Revamping the Apache Airflow-based workflow orchestration platform at Coinbase

Tl;dr: In this blog post, we share how we have revamped our Apache Airflow-based orchestration platform to ensure security, operational efficiency and development velocity. We also share our experience on migrating pipelines and onboarding users to the revamped platform via crowdsourcing.

By The Coinbase Data Platform & Services Team

Engineering

, November 1, 2022

Coinbase Blog

Apache Airflow was initially adopted in 2017 by the Coinbase Data Platform team to build ETL (extract, transform, load) pipelines. The initial iteration of our Airflow-based orchestration platform consisted of a single cluster deployment and a single Git repository. Over the past few years at Coinbase, our Airflow platform evolved into the de facto workflow orchestration platform and our Airflow user base quickly grew to include hundreds of engineers, data scientists and analysts across the company (see Figure 1). 

Multiple iterations of the Airflow-based orchestration platform to both fulfill our commitment to build the most trusted and secure crypto platform in the world and to meet the demands of the explosive growth of internal Airflow usage resulted in an architecture consisting of multiple, independently developing clusters and repositories. We had created new clusters to achieve better isolation to address security and compliance concerns; The proliferation of new pipelines had prompted us to add common libraries and customizations to vanilla Apache Airflow. By 2021, we were maintaining multiple Git repositories each dedicated to deployment setup, common library development, and Apache Airflow customizations on top of individual repositories for each of our clusters.

Screen Shot 2022-09-22 at 4.54.43 PM
Screen Shot 2022-09-22 at 4.53.25 PM

Figure 1. Total numbers of pipelines and monthly pull requests over time.

The multiplier effect from a combination of this fragmented architecture, increasingly complicated development workflow, and continued rapid growth of Airflow usage created significant operational overhead for the Data Platform team and prevented us from rapidly iterating on the platform. Notably, we were stuck with running the EoL’ed (end of life) Airflow 1.x. To get us out of this predicament, the Data Platform team decided to rebuild the platform with the following architectural improvements:

  • Keep Airflow Up-to-Date: we streamlined the process of installing and customizing different Airflow deployments to enable swift and regular Airflow version upgrades.

  • Embrace the Monorepo: we adopted a Python monorepo to unify developer experience, streamline CI/CD, and accelerate developer velocity.

  • Tailor the DevX: we developed tailored development environments to serve diverse user profiles and minimize support efforts.

Lastly, we knew the speed of migration was of the essence in order to minimize the period of coexistence of both generations of the platform, which not only creates overhead for the Data Platform team to maintain and support both generations, but also requires users to constantly switch environments. The sheer number of pipelines and tribal knowledge embedded in each rendered it impossible for the core project team to complete the migration on its own. We tackled this problem from a cultural and organizational angle and adopted crowdsourcing to enable a rapid large-scale migration.

Keep Airflow Up to Date

At Coinbase, Airflow processes highly sensitive payloads. It was imperative we stayed as close as possible to the latest stable Airflow version with the latest security and bug fixes. When we started the architecture revamp in late 2021, we were still running Airflow 1.10, which reached EOL in June, 2021. Airflow 2.0 had been released a year earlier, but we were unable to upgrade to the latest version due to the following challenges: 

  • We were running a patched version of open source Airflow to address our requirements on security, observability, and internal features.

  • Proof-of-concept experimentation revealed that a large number of pipelines would be broken after upgrading the Airflow version to 2.0 due to incompatible changes in this major release. 

  • The complexities caused by customizations in different Airflow Clusters and independently evolving yet cross-referenced Git repositories made even a minor Airflow version upgrade a tedious and error-prone process that could easily cause production incidents.

To unblock the version upgrade, we took the following approaches to drastically simplify future version upgrades and enable us to “change the airplane engine while in flight” at a more regular cadence:

  • Airflow patching removal. Patching is a deceptively easy mechanism to add customizations, but it produces big tech debt. We reimplemented all the patching-based customizations and enhancements by using non-intrusive mechanisms such as Airflow Plugins, consolidating all configuration customizations to a configuration file, extending the BaseOperator, and more.

  • Codemod incompatibilities. We leveraged Airflow 1.10.15, the Airflow 1 to Airflow 2 “bridge release,” to surface incompatibilities. We then implemented automatic codemod to perform repo-wide refactors to fix incompatible code patterns. Codemod significantly reduced the effort of subsequent migrations of individual pipelines.

  • Incremental upgrade. Instead of implementing a one-time upgrade in-place, we created new clusters from scratch to enable incremental migration of pipelines. This helped us minimize disruptions to production workload and gave individual pipeline owners enough breathing room to decide on a migration time frame that best fit their business requirements.

Embrace the Monorepo Figure 2 illustrates the original poly-repo architecture of our Airflow-based workflow orchestration platform consisting of dedicated repositories for common libraries, dependency management, container environments, development tools, as well as for each Airflow cluster.

0 0

Figure 2. Code repository dependency graph

The original poly-repo architecture suffered from the following issues that slowed down developer productivity and made us prone to incidents:

  • Extra release and version bump process. Changes made in one repository had to be published to PyPi or a container registry before they could be used in another repository. This meant in order to apply changes in one repository to another, developers had to manually bump the dependency version of the changed repository in the other repository’s requirements file and update the local environment. This not only slowed down developer velocity but also significantly increased the amount of time it took to make hot fixes.

  • Onboarding, support and maintenance overhead. Varied structure, setup, and tooling customized to each repository created extra onboarding overhead and issues for Airflow users across Coinbase. This meant extra support and maintenance workload for the Data Platform team.

  • Inadequate test coverage. Tests are usually scoped to a single repository with dependencies to snapshots of upstream repositories. As the repositories evolved independently, it was often the case that not all continuous changes made in upstream repositories were covered by tests in downstream repositories. As a result, bugs were often found after the fact and cost us extra engineering effort.

To address the above challenges, we adopted the monorepo architecture. Concretely, we built a Python monorepo for the Data Organization to consolidate organically relevant projects into a single location. There are both benefits and challenges to adopting and operating a monorepo, which we discuss in previous blog posts (Part 1 and Part 2).

Tailor the DevX Airflow users at Coinbase can be roughly classified into the following groups: 

  1. Data Engineers: Use Airflow for data extraction and loading.

  2. Software Engineers: Use Airflow for running cron jobs.

  3. Data Scientists: Use Airflow for data transformation.

  4. Data Analysts: Use Airflow for analytics.

  5. Non-technical employees: Use Airflow to execute pre-built jobs.

Different users develop their workflows on top of Airflow at different levels of abstraction and have varying testing, staging and production requirements. Thus, we found that there is no one-size-fits-all solution when it comes to DevX. We decided the best approach to supercharge users’ productivity is to provide a set of development environments users can choose from based on their usage profile. The following is the set of development environments in ascending order of complexity and level of control:

  • CLI-based: We wrap Airflow environment setup, Airflow CLI and a set of internal commands into a single CLI in the form of a prebuilt pex executable.

    • Pros: Hides all the complexities of Airflow and Python environment setup from the user and offers the lowest barrier of entry.

    • Cons: Not a full Airflow installation; not suitable for cross-repository testing.

  • Virtualenv-based: The user sets up a virtualenv on their machine, installs the required dependencies and ensures Airflow is properly configured.

    • Pros: Compatible with our original poly-repo architecture and allows developers to test code that hasn’t been migrated to the Airflow2 monorepo.

    • Cons: May face issues during environment setup; difficult for the Data Platform team to provide support since local setups unique to each machine make reproducing an issue a user is facing difficult.

  • Docker-based: We provide a docker-compose environment for developers to run a full fledged Airflow cluster on their machine. This setup is closest to our production environment and is often used by developers for end-to-end testing.

    • Pros: Makes Airflow user support smoother for the Data Platform team since a Docker-based environment provides a consistent environment across different machines for reproduction of issues; is a full Airflow installation and supports running pipelines end to end.

    • Cons: Requires heavy usage of machine resources and is slow to start; requires non-trivial effort for initial setup and ongoing maintenance.

  • Hosted Playground: We provide shared Airflow clusters in staging environments hosted by the Data Platform team to be used as test playgrounds. Developers can directly deploy and test their work-in-progress pull requests in the playground.

    • Pros: There is no need for local setup since the playground is hosted.

    • Cons: Since this is a shared environment, users are constrained to the Airflow UI for testing.

Migration via Crowdsourcing

A large number of existing pipelines owned by hundreds of users needed to be migrated to the revamped architecture in an incremental and non-disruptive manner. Migrations of this scale are labor intensive and can face cultural resistance. To tackle these challenges, we took a crowdsourcing-based approach to not only speed up the migration, but also to leverage it as an opportunity for user training. Below are the key steps we took to implement this approach:

  • Identify owners and reach alignment. Pipeline owners were identified via codified ownership information and Git commit history. For pipelines with obscure ownership, we directly communicated with relevant parties to clarify ownership. Technical Program Managers were involved in this process.

  • Automate wherever possible. We built an automated tool to perform mechanical tasks  involved in pipeline migration including but not limited to copying code to the new monorepo, identifying incompatible code patterns, and updating the metadata. Automation minimizes both human effort and human error.

  • Train users. We knew proactively training our users on the new tooling and development process of the brand-new Python monorepo would be key to increasing participation and minimizing cultural resistance.

  • Track the progress. We created tickets for each pipeline, and tracked the progress of migration in a publicly visible dashboard.

  • Incentivize the migration. Migration work is often deemed as low priority, but an extended period of coexistence of old and new architectures can cause a lot of pain. To encourage our users to prioritize the migration, we built processes to incentivize our users. We organized “migration parties” (social events centered around pipeline migration), sent out official appreciation notes, and created a leaderboard of pipeline migrators ranked by the number of pipelines migrated to gamify the process.

Future Work

Our Airflow-based workflow orchestration platform is one of the most important software infrastructures at Coinbase. Hence, it is imperative that we continue to add improvements to ensure the long term success of many engineering teams that rely on the platform to run critical workloads. We’d like to share a few areas for future improvements that the Data Platform team is currently exploring:

  • Pipeline ownership. The simple pipeline owner codification mechanism that comes with vanilla Airflow is often inadequate for us to properly codify pipeline ownership since a pipeline owner at Coinbase usually consists of a number of cross-functional parties: business requirement owner, pipeline developer, on-call team, external integration point of contact, etc. Comprehensive, structured, and up-to-date ownership information is needed to continue to scale Airflow operations.

  • Pipeline observability. Airflow provides basic alerts for task and pipeline failures, but lacks the ability to proactively monitor and detect anomalies in pipelines. Also, pipeline logs and metadata are usually scattered across multiple systems because a pipeline can often trigger external jobs. The lack of a one-stop shop for pipeline observability makes pipeline health diagnosis and triaging issues challenging.

  • Operational lineage. Traditional data lineage solutions focus on lineage between data tables and lack the ability to connect the tables to the pipelines and tasks that generate or consume the data. Common data operational tasks like impact analysis often require tedious manual inspection of multiple pipelines running across different Airflow deployments to connect the dots. The ability to track dependencies across data tables and pipelines across multiple deployments and systems will significantly improve our operational efficiency.

--

Acknowledgement

The successful wide adoption of this project would not have been possible without contributions from many people across various teams at Coinbase. We cannot possibly thank everyone but would like to make a special shout-out to the core project team: Jianlong Zhong, Sungho Yoo, Mingshi Wang, Brandon Lawrence and Yisheng Liang. We would also like to thank Eric Sun for supporting the project team, as well as Michael Li and Leo Liang for sponsoring the project internally.

Coinbase logo
The Coinbase Data Platform & Services Team

About The Coinbase Data Platform & Services Team

The Coinbase Data Platform & Services Team

Take control of your money. Start your portfolio today and get crypto.

Sign up for a Coinbase account today and see what the world of decentralized finance can do for you.