Coinbase Logo

Databricks cost management at Coinbase

Tl;dr: This blog outlines the cost management strategy implemented at Coinbase for Databricks applications, including the launch of a cost insights platform and quota enforcement platform.

By Mingshi Wang, Ben Xu

Engineering

, May 24, 2023

Coinbase Blog

Databricks is a critical platform for the success of many products at Coinbase. However, it is crucial to exercise caution and prevent excessive expenses to remain within the company's allocated budget. This requires investigating the source of expenses, optimizing existing jobs, and implementing quota enforcement for effective expense management. To address these needs, we have developed a comprehensive cost management platform that includes systems for cost attribution, cost insight pipeline, and quota enforcement.

For cost attribution, we have enforced cluster tagging across all Databricks jobs. This enables us to identify resource usage by teams, providing better insights into which teams are utilizing specific resources. To gain insights into cost analysis, we have leveraged Databricks Overwatch, a powerful tool that extracts valuable information from logging data. This helps us determine the sources of costs and allows us to propose effective strategies for cost reduction to stakeholders. In order to proactively prevent overspending, we have implemented a centralized quota enforcement mechanism. This ensures control over cluster resource usage in the future, helping us manage expenses more efficiently.

chart1

We utilized Databricks Overwatch as our primary tool to extract cost insights from Databricks logs. The cost insights platform relies on the Databricks log delivery, which provides critical information about DBU usages, cluster performance, and job metrics. We implemented a Databricks pipeline based on the Overwatch library in order to process the logs into curated cost insight data. Additionally, we developed an analytics pipeline that interprets the Overwatch data and renders the result to the presentation layer on a daily basis. This enables us to closely monitor the cost trends and make informed decisions regarding cost controls.

Data models

We collect several types of logs as inputs for the cost insights platform, including:

  1. Billable usage logs: These logs provide monthly reports of usage by DBU or dollar amount and can be grouped by workspace, SKU, or custom cluster tag.

  2. Audit logs: These logs contain all Databricks job activities and DBU information.

  3. Cluster logs: These logs contain cluster configurations, usage metrics, and job activities.

  4. Spark events: These logs are generated by Apache Spark and contain job, execution, and task metrics that are useful for performance profiling.

In addition to the logs we collect, the cost insights pipeline integrates with Databricks and AWS cloud APIs to enrich its calculation.

Curated cost insight data model

Databricks Overwatch processes logging data through multiple stages (from bronze to silver to gold to presentation) with each layer building upon the previous layer. To enrich the data, Overwatch also integrates with the Databricks and AWS cloud APIs.

We analyze the growth and utilization of the platform by examining the number of active jobs over time. Resource allocation patterns are understood by tracking the trend of cluster sizes, including nodes, vCores, and memory. We monitor cluster costs, both overall and per job, considering DBUs and compute costs. Job quality is evaluated using Overwatch-recommended metrics, identifying optimization opportunities. Spark profiling metrics aid in performance tuning, distinguishing CPU-bound jobs and optimizing node types. Shuffle read/write metrics guide the selection of appropriate node types to reduce network I/O costs. Finally, we analyze notebook metrics, specifically identifying notebook jobs running on all-purpose clusters (discouraged due to higher costs) and pinpointing the most costly notebooks. By analyzing these metrics, we can optimize the performance and cost of Spark jobs.

Below is a graph that shows a section of the cost insights dashboard.

chart2

Quota enforcement

chart3

Databricks instance pools

To manage resource usage for our job clusters, we created a set of predefined Databricks instance pools to support our jobs. An instance pool is a collection of ready-to-use cloud instances that are idle and can help reduce the start-up and auto-scaling times for job clusters. By appropriately configuring and tagging our instance pools, we can easily control costs while ensuring efficient resource allocation.

To meet various data processing requirements, we created the following categories of pools:

  1. General purpose instance pool: This pool is available to all users and serves as the default option.

  2. Compute optimized instance pool: This pool is accessible to approved users or LDAP groups and is designed for CPU-intensive jobs that require significant compute power.

  3. Memory optimized instance pool: This pool is accessible to approved users or LDAP groups and is intended for memory-intensive jobs that require high memory capacity.

  4. Development instance pool: This pool is open to all users and is ideal for onboarding new applications. It has fewer constraints on resource usage, but expires after 30 days.

  5. Special purpose instance pool: This pool is accessible to approved users or LDAP groups and is tailored to applications with special requirements, such as those that require GPU.

To meet our optimization goals, each pool is configured with specific attributes. We minimize costs by setting the minimum idle instances to 0, avoiding running idle instances. However, this may result in longer cold start times. To further reduce costs, we set the idle instance auto termination to the default value of 2 minutes. Based on Overwatch data analysis, we observed that the time between successive submissions is typically over 10 minutes, making longer idle times unnecessary. The maximum capacity of each pool is determined by analyzing Overwatch data and aligning it with resource requirements.

Cluster policies

We have implemented predefined Databricks cluster policies to manage the maximum cost of job clusters created by users. Some key attributes include Instance_pool_id for selecting predefined instance pools, num_workers and autoscale.min_workers for worker configuration, and autoscale.max_worker for dynamic scaling. The node_type_id and driver_node_type_id attributes specify supported instances based on performance tuning. We track job ownership using custom_tags. The dbus_per_hour attribute sets limits on maximum DBUs per hour. The cluster_type is set to "job" for cost efficiency, and spark_version ensures compatibility. These policies effectively manage cluster cost and behavior while meeting user requirements.

Acknowledgements

We would like to express our gratitude to Sumanth Sukumar, Leo Liang, and Eric Sun for their support in sponsoring this project. We also extend our appreciation to our data platform team members, particularly Hans Wu, Yisheng Liang, and Chen Guo for their valuable contributions to the project. Their dedication and expertise were instrumental in the successful implementation of the cost insights platform.

Coinbase logo
Mingshi Wang
Ben Xu

About Mingshi Wang and Ben Xu

Staff Software Engineer

Senior Engineering Manager