TL;DR: Coinbase has developed and deployed a machine learning model that predicts spikes in user traffic and automatically scales databases, preventing downtime and increasing platform efficiency.
By Gilad Buchman, Rubbal Sidhu, Indra Rustandi, Kyran Adams, Minu Jung, Prateek Gupta, Roman Burakov
Engineering
, August 26, 2024
Crypto markets can be volatile. On the Coinbase platform, spikes of user activity and traffic can occur suddenly and quickly, and then disappear just as fast. We handle those changing traffic patterns and workloads by scaling up and adding resources in times of high traffic, and scaling down back to normal after the spike in traffic has passed. Scaling up though is not an instant process, starting to scale when traffic is already high is often too late. Therefore, we developed an automatic scaling solution that uses machine learning (ML) to predict the traffic spikes and trigger a scale up before the traffic arrives.
This solution served us well during a recent volatile market period when traffic levels suddenly rose significantly above normal levels. The below graph shows actual user traffic against scaled up database levels, for a two week period.
As traffic increased so did our scale target, doubling twice a few hours before peak traffic. The model continued scaling up and down with the daily usage pattern, until volatility decreased and there was no longer a need to scale up.
Most web services can scale up by adding more machines through a process known as Horizontal Scaling, however, this process is not possible with our databases. The different methods to handle database scaling all have challenges:
Horizontal scaling: Adding nodes/replicas is a slower process, because the new instances need to restore from an existing snapshot which can take upwards of two hours. Coinbase also uses a single-writer per shard architecture that does not have the capability to scale writes by adding more nodes.
Sharding: Coinbase uses sharding wherever appropriate, but, we cannot quickly increase the number of shards in response to a traffic spike, as resharding is an expensive and slow operation.
Vertical scaling: Scaling instance size causes a temporary drop in capacity as the nodes need to be restarted one at a time. However, it is a more efficient process as the existing disk can be detached from the old instance and then re-attached to the new scaled up instance making it a viable option.
But even with vertical scaling, when do we start scaling? The usual solution is to look at “CPU workload”, and to scale up when it crosses some threshold. However, CPU is a "lagging" indicator - meaning it goes up after the load on the database has already increased. That is why we looked for a predictive model that provides a signal to scale up with enough lead time before a traffic spike, allowing our databases to scale up in time and prevent outages and downtime. This also helps improve efficiency by removing the need for our systems to be unnecessarily pre-scaled for high traffic.
The task for the predictive model is to provide a signal with 60 minutes lead time before a traffic spike is happening. To be effective, the model should be accurate and not miss any traffic spikes.
In our early work on traffic modeling, we tried to predict upcoming spikes by developing a time series forecasting model. Meaning, it tried to predict what the traffic level would be 60 minutes into the future. This model took into account real-time indicators of system load and traffic patterns. After extensive modeling and analysis, we concluded that there wasn’t enough time lag in the underlying statistics to make this a viable approach. Simply speaking, by the time our systems begin to notice the start of a spike, it was too late to react.
Instead of relying on a short-term time series forecasting model, we transformed the problem into a longer-term classification one. In addition to load on our platform (adjusted for periodic and seasonal changes), the new model leverages external signals such as price fluctuations in major cryptocurrencies like Bitcoin, Ethereum, and others. The model tries to answer the question of whether traffic will exceed a certain threshold level in the next few hours. This approach has worked significantly better and increased our accuracy.
The key insight: if cryptocurrency price volatility is high and the current traffic is approaching the target level at a faster rate than anticipated, then the likelihood of a traffic spike is increased.
We tested the accuracy of our traffic spike classification model on historical data to make sure it met our needs and worked well in changing market conditions and with varying levels of traffic. There are two ways in which the model can be wrong. First by missing a traffic spike and not alerting in advance. This is very costly since it could lead to service unavailability. The second error is alerting too often, even if there is no traffic spike. This will cause unnecessary scaling up, and be costly. The model was tuned to avoid any mistakes of the first kind, and minimize the mistakes of the second kind. This tradeoff is acceptable, as it allows us to have ongoing minimum lower capacity that could handle day to day traffic, and be scaled-up only some percentage of the time, providing both reliability and efficiency.
We then converted the predictions of the classification model into a single metric called the "scale target", which indicates the level of traffic our infrastructure should be able to handle at that moment. This helps ensure we have enough capacity for expected fluctuations before a potential spike occurs.
How it works:
When the classification model predicts a spike above the current scale target (which is always higher than current traffic level), we escalate the scale target up to 2x its current level.
When the model’s predicted level stays below the scale target for six hours, we then de-escalate back to a lower scale target.
The scale target metric is monitored by the Auto-Scaler module. It polls for the current traffic information and the scale target metric every minute. Based on this information, which tells it what is the level of traffic our databases should be able to handle at the moment, it decides on the next action. If the scale target output by the model is higher than the current capacity, the auto-scaler will scale up the databases to match the scale target. Additionally, the autoscaler is responsible for scaling down the clusters when it is safe to do so, based on factors such as last hour traffic volume and time of day.
A separate machine learning model is used to estimate how much capacity is needed for different traffic levels. This is done on a weekly basis by looking at load testing data. We run frequent company wide load tests which give us an indication of how a cluster performs with an increase in traffic. The capacity planning job is a linear regression on traffic and several database metrics with CPU and IOPS being the most important. It also factors in performance from the previous month.
In the example below, we analyze the linear regression of our users cluster and select the smallest database instance that meets our CPU and IOPS requirements.
Coinbase developed a machine learning model to predict traffic spikes and automatically scale databases to prevent downtime and increase efficiency. This predictive scaling approach allows Coinbase to optimize infrastructure costs by avoiding over-provisioning while ensuring their platforms remain reliable during unpredictable crypto market movements. As the next step, we are planning to apply this new capability of predicting traffic spikes to other infrastructure challenges, and provide an even more reliable and resilient system to our customers.
Special thanks to the Coinbase engineering team for their contributions to this blog.
Indra Rustandi, Staff Machine Learning Engineer, Kyran Adams, Senior Machine Learning Engineer, Minu Jung, Senior Machine Learning Engineer, Prateek Gupta, Senior Engineering Manager, and Roman Burakov, Machine Learning Engineer
About Gilad Buchman and Rubbal Sidhu
Gilad Buchman is an Engineering Manager and AI/Machine Learning lead based in Israel. With over 15 years in tech, Gilad specializes in machine learning, recommendation systems, and distributed systems. Gilad leads the Machine Learning Recommendations team at Coinbase, focusing on personalization and large language model applications. Previously, he spent nearly 8 years at X, formerly Twitter, as machine learning Engineering Manager of the Explore and Trends team. Gilad holds an MS in Computer Science from the University of Pennsylvania.
Rubbal Sidhu is a staff software engineer with the Infrastructure team at Coinbase.