Coinbase Logo

Language and region

Boosting App Performance: Strategies to Optimize Network Requests

Tl;dr: Coinbase optimizes network request patterns via architecture improvements, peak load optimizations, and guardrails to scale systems and onboard the next billion crypto users.

By Manjiri Moghe

Engineering

, December 18, 2024

Coinbase Blog

The growing popularity of cryptocurrency trading has led to a substantial increase in adopters. This increased adoption can lead to significant traffic surges to exchanges like Coinbase, resulting in dramatic price fluctuations of cryptocurrencies within minutes. As our user base and transaction volume continue to grow, ensuring our systems can handle peak loads without compromising availability and performance is critical. There are several measures we have implemented to achieve this:

  • Auto-Scaling Policies for database resources: dynamically scale with demand, preventing downtime and increasing platform efficiency.

  • Caching: significantly reduces load times and improves user experience by deploying caching strategies across our APIs and frontends using a CDN.

  • Load Testing: identify and mitigate bottlenecks before they affect users, ensuring smooth performance during peak times.

While these measures are crucial for maintaining system resilience and performance, further enhancing scalability requires continually optimizing network request patterns in our mobile and web applications to minimize redundant API calls and improve overall efficiency. This approach ensures full-stack efficiency and reinforces our dedication to building and maintaining customer trust through reliable and high-performing services.

Why optimizing the number of network requests matters

Optimizing the number of network requests offers significant advantages for both end users and Coinbase.

For end users, this approach leads to faster load times, allowing for quicker execution of transactions. By minimizing data transfer, performance is improved for lower-end devices, ensuring a seamless and consistent user experience. This results in smoother interactions throughout the application, thereby greatly enhancing overall usability and customer satisfaction.

For Coinbase, reducing network requests translates to a decreased server load, which alleviates strain on the infrastructure. This improvement allows the system to handle a higher volume of requests more efficiently, facilitating better scalability. Consequently, this leads to more cost-effective operations, as the expense associated with scaling infrastructure is significantly reduced.

*It is important to note that a higher number of requests is not bad, it is the unnecessary requests that need to be eliminated.

Architecture Overview

To understand Coinbase’s approach to improving the network traffic pattern, let’s review the high-level architecture.

architecture overview

Coinbase uses Relay & GraphQL as our data layer. A client-side GraphQL query calls various resolvers (resolver: think of it as a function that aggregates data by calling various API endpoints) to collect data necessary for the client to render a screen or page. A single request originating from a client and passing through our gateway and GraphQL is capable of generating anywhere between 1 - 100+ upstream requests depending on the data being requested. It is vital to pay attention to the exact number of requests client applications are making upon cold start, initial load, and across our critical user journeys. We started exploring this topic as part of our load testing workstream to optimize load during high traffic but soon realized that there were opportunities to improve our request pattern even outside of high traffic.

Categorizing Improvements

We identified 3 main themes to categorize our improvements by:

  1. Limiting over-fetching.

  2. High traffic optimization.

  3. Establishing guardrails.

Let’s dive deeper into each one below:

1. Limiting over-fetching

Auditing request patterns regularly and, more importantly, when adding new code is crucial. This helps measure the impact on client performance, upstream services, and databases and identifies cases of over-fetching before it's too late. Common pitfalls that lead to over-fetching are inefficient API design, polling, requesting unnecessary data, and deprecated code cleanup.

Inefficient API design

As a trading platform, Coinbase naturally has a lot of asset information and lists of assets shown on various product surfaces across web and mobile apps with different contexts such as tradable, sendable, and owned. This data is sourced from multiple services, aggregated at our GraphQL layer, and then exposed to the client with the help of the resolver function.

For eg: To display your detailed portfolio 2 - 3 services are called:

  • Service 1: Fetch asset information like asset name, price, and image etc.

  • Service 2: Fetch the list of assets owned by the user.

  • Service 3: Fetch the balance for assets owned by the user.

The above is an N+1 query problem. It may seem faster to execute multiple smaller queries than one large complex query but that is rarely the case. Every individual request will correspond to a database request resulting in server time and resource consumption. We audited our most critical user journeys and tier 0 services to identify improvement opportunities to fix this N+1 problem and brought down our request numbers by almost 90% in some cases. Solutions included:

  • Update APIs to accept multiple parameters in one request.

  • Use data loaders on resolvers.

  • Add pagination support. 

Polling

Polling is one of the easiest ways to notify users of an immediate change. However, when implemented incorrectly, it can hurt app performance and waste valuable infrastructure resources especially if the data changes infrequently. There are a few things to be mindful of when implementing polling:

  • Frequency of polling.

  • Unsubscribe from polling if the user moves away from a screen.

Let’s take a closer look at unsubscribing from polling. The Coinbase mobile app uses React Native. On react native applications, navigation stacks work like so: older screens remain mounted, even as you go deeper in a stack (or flow of screens). If one of those screens in the stacks implements a poller, it will continue to poll data unless explicitly cleared when the screen is not focused. Here’s an example:

polling

This rootstack has 3 screens with the Home screen implementing a poller for fetching up-to-date balance information. Let’s assume it polls data every 2 seconds and makes 40 upstream requests per poll. The user navigates to “View Accounts” and then “View credit card balance” screen and stays on the “View credit card balance” screen for 1 minute. If the poller is not unsubscribed when the “Home screen” is out of focus, it will continue to fetch balance data even if the user is not on the home screen, making 40 upstream requests every 2 seconds resulting in 1200 upstream requests/ minute in this session. These 1200 requests are unnecessary and easily avoidable by unsubscribing the poller when the home screen is not focused, saving latency and infrastructure costs.

  • Challenge the need for polling: Engineers often default to polling to provide users with accurate and up-to-date information. However, it is essential to evaluate whether polling is necessary, especially if the data doesn’t change frequently. If there is a screen in the app that displays a list of user’s assets with balance, with asset prices updating frequently, you may think polling would be a good option to show users their correct balance at all times. Although, there are a few questions that need to be asked before implementing polling here:

  1. User Behavior Analysis

    1. Question: How many seconds does a user typically spend on this screen?

    2. Insight: If users spend less than a few seconds on this screen, they might not notice the balance updates. Frequent polling might be unnecessary and can lead to inefficiency.

  2. Infrastructure Impact

    1. Question: What is the impact of polling on upstream services and infrastructure?

    2. Insight: Assess the number of additional requests generated by polling, especially during peak times. This can significantly affect upstream service performance and infrastructure stability.

  3. User Experience (UX) Enhancement

    1. Question: Does polling significantly improve the user experience to justify its impact on upstream services?

    2. Insight: Evaluate if the real-time balance update enhances the user experience enough to outweigh the negative impacts of increased load on the services.

  4. Alternatives to Polling

    1. Question: Is there an alternative to polling, such as using subscriptions or webhooks?

    2. Insight: Explore alternatives like WebSockets, push notifications, or event-driven architectures (e.g., subscribing to updates). These alternatives can provide real-time updates without the need for constant polling, thus reducing the load on infrastructure.

Requesting unnecessary data

At times, we inadvertently over-fetch data, not realizing it could be cached more efficiently or that it may not be relevant for the user type. Effective caching is essential for any website or mobile app to ensure faster load times and alleviate server load. Here are some common strategies to avoid over-fetching:

  • Set appropriate TTL (Time to Live) values.

  • Define fetch policies and directives (specific to GraphQL).

  • Cache data at the edge, utilizing CDNs.

  • Cache common data models, such as asset information and user information, on the client side.

  • Add eligibility checks for product features.

  • Implement memoization for React components wherever possible.

Let’s look at some examples of requesting unnecessary data:

Example 1: There may be a feature in the app that only UK users are eligible for, but we do not fetch data conditionally. A check for the user's country before fetching data is helpful in avoiding making unnecessary requests.

GraphQL provides directives like @skip and @include which help with requesting data conditionally.

Example 2: There are certain pieces of information that do not change often, like the user's profile information (name, country, native currency). On a financial app like Coinbase, country, and currency information are very important. But fetching them on all screens is not ideal. Instead, a better caching/ storage strategy can be implemented to reduce additional calls to backend systems and underlying databases.

Example 3: We display a list of assets in multiple places across our applications. But rarely do we need to show ~300 assets at once. In such cases, proper pagination can break queries into pages and avoid fetching all the asset data at once, reducing load on backend systems and improving latency of requests.

Deprecated code cleanup

We have implemented various static code analyzers to identify unused modules and low-traffic experiences, and resolved A/B tests to ensure dead code is swiftly removed from production and any unnecessary requests to backend systems are avoided. For A/B tests as an example, once a test concludes it's crucial to clean up the code associated with the unused paths as this data may be getting fetched for both sides of the experiment and duplicating data lookups.

2. High traffic optimization

Coinbase must guarantee that our users can access the application seamlessly without any disruptions during significant price movements. We built a system that predicts traffic surge. We monitor the health of critical services and databases, and trigger a few key optimizations. The main goal of these optimizations is to dynamically reduce non-critical traffic from our client applications via a configuration service API, which helps alleviate the load on our backend systems. This includes:

  • Turning off preloading of data (optimistic fetching).

  • Increasing cache TTL on the client side.

  • Optimizing retry logic on error screens to be less aggressive.

  • Disabling non-critical product features like promotions which are less likely to be visited during market volatility. 

  • Launch lite versions of product features.

As a result of the above, we are able to achieve a significant reduction of around 64% requests during high traffic.

3. Establishing guardrails

Maintaining system integrity at Coinbase is paramount, and our guardrails play a crucial role in this effort. These guardrails are designed to detect regressions early and engage the relevant teams for timely resolution. By doing so, they not only maintain but also elevate our quality standards, resulting in a performant and highly scalable application.

Our guardrails are strategically established at various stages of our architecture to detect request anti-patterns, including:

  1. Client to Gateway.

  2. GraphQL to Individual Services.

  3. Traffic to Critical Services, grouped by client queries and platforms.

To adhere to these guardrails and avoid violations, we continuously seek to introduce new metrics where we have observability gaps. These metrics are essential for enhancing our performance and monitoring our baseline numbers, thereby ensuring system stability and reliability.

Defining new metrics

Gathering the right data is crucial to making informed decisions and driving the strategy forward. Defining success metrics beforehand can help inform how we want to measure and improve our systems.

Example 1: Pinpointing network request regressions in our mobile app was difficult because app versions were previously not included in some key GraphQL metrics.

The addition of the app version enabled us to successfully identify regressions in newer app versions early in the development process, even before the production rollout. This proactive measure has prevented any potential negative impact on our external customers, ensuring a smoother and more reliable user experience.

Example 2: Adding a metric to identify cached percentage per GraphQL operation on the client side: by measuring cache percentages we can identify more aggressive caching opportunities on the client side.

Monitoring

A combination of anomaly detection and threshold monitoring has helped us cover most of our use cases. Key monitoring areas include:

  • Number of requests on app initialization.

  • Number of requests per critical user journey (CUJ).

  • Number of requests per screen. 

Any regression to request patterns in these key areas has a direct business impact to Coinbase. Let’s do a quick walkthrough of how these monitors helped us catch a regression problem.

Example: A new field, which was still in the development phase was added to one of our critical queries without gating appropriately behind an A/B testing experiment. It resulted in more than 20 additional upstream fetches for all users per query.

  • The issue was caught in the development phase through our monitoring system which tracks the number of requests per screen.

  • The responsible team was involved in triaging the issue.

  • The issue was resolved by addressing the gating on the client. Additionally, the call pattern between services was optimized to prevent the 20+ upstream requests corresponding to a single field.

Since we started working on this at the beginning of this year, we have been able to reduce 64% of our requests from mobile and web apps during high traffic, 30% on application initialization, and 40% from our critical user journeys making up 20% traffic to our critical services without compromising the user experience and quality of the applications. This also means our services can handle traffic more efficiently resulting in reduced infrastructure costs and higher uptime. It is important to continuously assess and optimize request patterns from feature conception to deployment. Challenging the number of requests helps uncover inefficiencies and streamlines performance. Regular code cleanups are essential to maintain a high-quality codebase while monitoring traffic patterns for critical user journeys, which enables swift identification of regressions. Ensuring that client traffic is justified helps avoid unnecessary scaling, contributing to cost efficiency.

Promoting knowledge sharing and best practices across engineering teams is vital. Utilizing automation tools such as code editors, PR checks, and AI code reviewers enhances continuous improvement and fosters collaboration, ensuring that our systems remain robust and efficient.

Coinbase logo