TLDR: Our recent evaluation of Kubernetes underscored its suitability for scaling Coinbase into the future. In the past, a migration to Kubernetes raised concerns due to the operational burden of running and securing the control plane in-house. We’ve now concluded that managed Kubernetes offerings reduce this operational burden without compromising our stack security.
Almost two years ago we released a blog post detailing why Kubernetes is not part of our technical stack. At the time, migrating to Kubernetes would have created a whole new set of problems that outweighed any near-term benefits. However, as these technologies have matured, our newly-formed Compute Team devised a strategy for leveraging Kubernetes in a way that can deliver a more flexible and scalable version of our current system.
Coinbase has grown substantially since we first considered migrating to Kubernetes. With any growth of this kind, it is important to prioritize scalability concerns. As we continue to scale, one of the main areas in need of future-proofing is Coinbase’s compute platform. In mid-2020, our largest service was configured to run a relatively small number of hosts, whereas today it’s running 10x that number.
In this same period, we quadrupled the size of our engineering organization causing a substantial increase in the number of deployments — each needing completely new hosts. The increase in the number of deployments have raised concerns over future scalability as we are already running into technical limitations of current APIs and resources. Recurring issues with getting enough capacity and having it delivered in a reasonable timeframe, caused an increase in failed deployments and required our largest services to dramatically slow down their release process.
While these issues are solvable, we decided to take this opportunity to evaluate whether it made sense to continue investing in a homegrown system or consider an open source alternative that would be much more scalable in the long term.
In our evaluation of Kubernetes, we found that one of the biggest advantages of a migration is that it decouples host provisioning from service deployment, moving the burden of managing host acquisition from individual teams to the broader Infrastructure team. This empowers the Infrastructure team to take a holistic approach to host management. Also, capacity constraints are less likely to affect deployments, and we reduce the amount of cloud provider specific knowledge that individual engineers need to maintain.
The Kubernetes community has created a wealth of knowledge and tooling that we can utilize to provide better support to teams and quickly enable new features. Additionally, as Kubernetes is extensible, there is still the option to build tooling internally and open source it for use within the wider community.
Security is incredibly important at Coinbase and securing Kubernetes clusters is a non-trivial undertaking. Transitioning from highly-isolated and single-tenant compute to a system which promotes multi-tenancy requires deliberate security design and consideration. Because we have high-security workloads where we have to guarantee isolation, we must run separate clusters and build automated tooling that handles all cluster operations. Giving individuals access to operate high-security infrastructure is not allowed.
Managed Kubernetes offerings, such as AWS EKS, take on the responsibility of operating, maintaining, and securing the control plane, reducing the operational burden of running many clusters. Reducing our operational burden and security responsibility enables us to focus on building the orchestration and automation that is required to support many clusters across a large engineering organization. EKS has significantly matured over the past few years and shown that it provides stable, operational Kubernetes while also integrating with features that are commonly used in EC2 such as being able to attach security groups to pods and IAM Roles to service accounts. Having those integrations reduces the risk and cost associated with migration, as they allow for migration without having to change the identity or access patterns of our current platform.
While the migration to Kubernetes spurred concerns in the past, we’ve now concluded that managed Kubernetes offerings, such as AWS EKS, can reduce the operational burden without compromising security. Ultimately, we realized there is a clear ceiling to the ability of our homegrown system to scale, and while there is a large set up and migration cost associated with a move to Kubernetes, we are confident that it will be more flexible and scalable than our current system.