Tl;dr: This blog post explains the development of a continuous integration (CI) and continuous delivery (CD) system for Databricks, which aims to enhance the efficiency of engineering development.
As Coinbase continues to onboard more applications to the Databricks platform, we recognized the need for a managed approach to building and releasing applications reliably and efficiently. The establishment of a robust CI/CD platform is an important component of Databricks development experience, as it significantly enhances engineering productivity. To achieve this, we have developed a CI/CD system capable of seamlessly orchestrating the source code through production. Our objective is to empower users to release their compute tasks in a simple and declarative manner, without having to deal with the complexities of the underlying system.
User experience
The development experience for the CI/CD pipeline involves a combination of user-owned steps and an automated workflow that handles the remaining deployment tasks. To illustrate the steps involved, let's take the example of a PySpark application with a driver file “pyspark_app.py”. To get started, we need to configure two YAML files: one for development submission and another for production submission. These files should contain all the necessary metadata for each environment.
### This is the example job configuration YAML for the development### environment. Similarly, the production configuration YAML is set### up with the relevant parameters for running the job in the### production environment.
kind: Jobmetadata:name: pyspark_apptarget: developmentdeploy Version: 0spec:mainApplicationFile:data_processing/src/python/my_pyspark_app/pyspark_app.y
After the configuration files have been set up, users can declare the compute task using customized Pants BUILD targets. The BUILD targets inform the CI platform to build the PySpark application into an archived job folder which can then be consumed by the CD platform for execution.
### data_processing/src/python/my_pyspark_app/BUILD
released_files(name="pyspark_app",sources=['pyspark_app.py'],)
compute_task(name="pyspark_app",sources=["development _config.ym|", "production _config.ym"],)
The CD platform provides two options for initiating the execution of a Databricks job. The first option is to use auto-deployment, which involves merging a pull request that increases the "deployVersion" value in the job configuration YAML, and then waiting for the CD system to automatically schedule the deployment. The second option is to use manual deployment which enables users to manually select a build and submit a job using that build via either the web console or the command line tool.
Through this process, we can ensure that our PySpark application is built and deployed in a consistent and automated manner, minimizing the risk of errors and increasing the overall efficiency of the deployment process.
Architecture of CI/CD platform
The platform comprises several key components that play integral roles in its functionality:
The serving layer is responsible for managing the lifecycle of Databricks jobs through gPRD APIs.
Temporal orchestrated workflows execute CI/CD processes, monitor Databricks job status, and handle cleanup of stale jobs. These workflows automate the entire cycle of building, testing, deploying, and monitoring Databricks jobs.
The Databricks job proxy is responsible for translating high-level compute tasks into low-level Databricks job management requests.
Observability layers provide valuable insights into the platform's performance and facilitate timely issue identification and resolution. This layer includes a dashboard for monitoring system metrics, Slack channels for real-time notifications of job status and significant events, and PagerDuty for alerting and incident management.
Continuous integration (CI)
The CI workflow is responsible for building and uploading code artifacts that are used in the CD pipeline. This workflow continuously builds the GitHub repository, generating binary files such as Python wheels, JAR files, and .tar files. It also compresses the packages of job configuration YAMLs and creates a metadata file that stores the GitHub commit SHA for the build. These artifacts are then uploaded to a persistent store for use in the CD pipeline.
Continuous delivery (CD)
The CD workflow deploys newly built code artifacts to the Databricks environment. It comprises a parent workflow that periodically fetches compute tasks from code artifacts and schedules submissions, and a child workflow that submits jobs to the Databricks environment. The job submission status is recorded in the Databricks job metadata database, allowing the CD pipeline to track job status and handle any failures or errors that may occur.
The CD platform offers two job submission strategies that cater to different use cases:
Simple strategy: This strategy is ideal for one-time batch processing jobs. When a new version of the job is submitted, the platform stops the existing job if it is still running and starts the new version. This ensures that only one version of the job is running at any given time and helps to avoid conflicts and errors that can arise from running multiple versions simultaneously.
Blue-green strategy: This strategy is specifically designed for submitting streaming jobs that require continuous data processing without any gaps. However, it requires careful consideration as it is crucial for guaranteeing the data quality of streaming applications.
It's worth noting that the blue-green strategy is critical for the majority of Coinbase's Databricks use cases, where streaming applications play a central role in processing and analyzing large volumes of data in real time.
With the blue-green submission strategy, the CD platform starts a job of the new version (green) at time "t0" without killing the old version (blue). The platform waits until the new version is up and running at "t1" and continues to run alongside the old version. To ensure that the new version is stable and produces accurate results, the CD platform waits an additional amount of time for the new version to catch up with the old version before killing the old version at time "t2". This additional time is determined by the "healthSeconds" parameter in the configuration YAML and is a critical component of the blue-green strategy. If the new version experiences issues during the “healthSeconds” period, the existing version remains unaffected and continues to function as expected.
Implementing the blue-green strategy requires a mechanism for ensuring that the new versioned job catches up with the old one before it can fully replace it. In our design, we used a distributed lock service and a streaming checkpoint table to coordinate both versions of the job. At any given time, only one job can hold the lock in order to persist outputs to the database. When a new version job is submitted, it fetches the latest streaming checkpoint — which could be the Kafka offset or blockchain height. The new versioned job then starts processing the stream side by side with the old version. As the new version catches up with the old version, it requests the lock. The lock service sees that the new version has been updated to date and grants the lock to the new versioned job. As shown below, the old versioned job loses the lock and exits.
Monitoring
The monitoring workflow is responsible for reporting the health state of the entire CI/CD platform as well as the job clusters. The workflow periodically polls the status of all submitted jobs across Databricks workspaces and updates the job metadata DB to be consistent with the actual state of jobs. For example, the workflow detects job runs started by Databricks due to a job schedule or a failure retry, and records the latest run Id to the job metadata table.
The workflow also collects metrics that reflect the healthiness of the platform. These metrics include the number of successful and failed CI builds, the latest build version (Git commit SHA) of the code artifacts, Databricks job submission reliability and statistics, and the health status of individual jobs. The monitoring workflow uses Slack channels to keep users informed about the progress and results of job submissions. In case of failures, it promptly alerts the job owners, enabling them to take immediate corrective actions.
Stale job cleanup
The architecture graph does not include the workflow for stale job cleanup. This workflow removes jobs that have not had any active runs in the past three months — which helps to ensure that the total number of jobs remains within the quota set by Databricks.
Acknowledgements
We would like to express our gratitude to Sumanth Sukumar, Leo Liang, and Eric Sun for sponsoring this project. We would also like to thank Yisheng Liang, Chen Guo, Allen He, Henry Yang, and many others for their contributions to the project. As we rolled out the platform, we received a lot of feedback that helps to improve the system. We appreciate the support and collaboration of everyone involved in this project.
About Mingshi Wang and Ben Xu
Staff Software Engineer
Senior Engineering Manager