Logs
Developing a strong metrics foundation at Coinbase helped to alleviate some of the problems we experienced with Elasticsearch as many workloads naturally migrated to Datadog. Now at least when an issue did occur on the monolithic cluster, engineers had data they could fall back on.
But the Elasticsearch issues and outages continued. As engineer headcount continued to grow, logging continued to be a source of frustration. There were about 7 new engineering productivity impeding incidents in Q4 2018. Each incident would require operational engineers to step through elaborate runbooks to shut down dependent services, fully restart the cluster, and backfill data once the cluster had stabilized.
The root cause of each incident was opaque — could it be a large aggregation query by an engineer? A security service gone rouge? However the source of our frustration was clear — we’d jammed so many use cases into this single Elasticsearch cluster that operating and diagnosing the cluster had become a nightmare. We needed to separate our workloads in order to speed incident diagnosis and reduce the impact of failures when they did occur.
Functionally sharding the cluster by use case seemed like a great next step. We just needed to make a decision between investing further in the elaborate automation we’d put in place to manage our existing cluster, or reapproach a managed solution to handle our log data.
So we chose to reevaluate managed solutions for handling our log data. While we’d previously decided against using Amazon Elasticsearch Service due to what we considered at the time to be a limited feature set and stories of questionable reliability, we found ourselves intrigued by its simplicity, approved vendor status, and AWS ecosystem integration.
We used our existing codification framework to launch several new clusters. Since we leverage AWS Kinesis consumers to write log entries to Elasticsearch, simply launching duplicate consumers pointed at the newly launched clusters allowed us to quickly evaluate the performance of Amazon Elasticsearch Service against our most heavy workloads.
Our evaluation of Amazon Elasticsearch Service went smoothly, indicating that the product had matured significantly over the past two years. Compared to our previous evaluation, we were happy to see the addition of instance storage, the support of modern versions of Elasticsearch (only a minor version or two behind at most), as well as various other small improvements like instant IAM policy modification.
While our monolithic cluster relied heavily on X-Pack to provide authentication and permissions for Kibana. Amazon Elasticsearch Service relies on IAM to handle permissions at a very coarse level (no document or index level permissions here!). We were able to work around this lack of granularity by dividing the monolith into seven new clusters, four for the vanilla logs use case, and three for our various security team use cases. Each cluster’s access is controlled by leveraging a cleverly configured nginx proxy and our existing internal SSO service.
“Kibana Selector” — our way of directing engineers to the appropriate functionally sharded cluster.
Migrating a team of over 200 engineers from a single, simple to find Kibana instance (kibana.cb-internal.fakenet) to several separate Kibana instances (one for each of our workloads) presented a usability challenge. Our solution is to point a new wildcard domain (*.kibana.cb-internal.fakenet) at our nginx proxy, and use a project’s Github organization to direct engineers the appropriate Kibana instance. This way we can point several smaller organizations at the same Elasticsearch cluster with the option to split them out as their usage grows.
Our current log pipeline architecture overview.
Functionally sharding Elasticsearch has not only had a massive impact on the reliability of our logging pipeline, but has dramatically reduced the cognitive overhead required by the team to manage the system. In the end, we’re thrilled to hand over the task of managing runbooks, tooling, and a fickle Elasticsearch cluster to AWS so that we can focus on building the next generation of observability tooling at Coinbase.
Today, we’re focusing on building the next generation of observability at Coinbase — if these types of challenges sound interesting to you, we’re hiring Reliability Engineers in San Francisco and Chicago (see our talk at re:invent about the types of problems the Reliability team is solving at Coinbase). We have many other positions available in San Francisco, Chicago, New York, and London — visit our careers page at http://coinbase.com/careers to see if any positions spark your interest!
This website may contain links to third-party websites or other content for information purposes only (“Third-Party Sites”). The Third-Party Sites are not under the control of Coinbase, Inc., and its affiliates (“Coinbase”), and Coinbase is not responsible for the content of any Third-Party Site, including without limitation any link contained in a Third-Party Site, or any changes or updates to a Third-Party Site. Coinbase is not responsible for webcasting or any other form of transmission received from any Third-Party Site. Coinbase is providing these links to you only as a convenience, and the inclusion of any link does not imply endorsement, approval or recommendation by Coinbase of the site or any association with its operators.
Unless otherwise noted, all images provided herein are by Coinbase.