Coinbase Logo

Part 1: ChainStorage: The Enterprise Blockchain data availability layer

Tl;dr: This is the first article in a three-part series exploring ChainStack, Coinbase's enterprise-grade blockchain data platform. In this series, we'll dive into the architecture, technology, and processes that make ChainStack a powerful foundation for data availability, computation, and indexing. In this post, we dissect ChainStack's components and share our learnings from the vantage point of software engineering.

By Jie Zhang, Leo Liang, Jian Wang, Ming Jiang

Engineering

, November 21, 2023

Coinbase Blog

ChainStorage is the data availability layer which makes data available, trustworthy and easy to access.

Early in 2021, the surging crypto bull market heightened the need for enhanced blockchain data processing capabilities, including real-time analytics, efficient NFT indexing, and machine learning on crypto data.

At Coinbase, various teams crafted tailored solutions for blockchain data indexing to meet particular use cases. Typically, this involved extracting necessary data from archival nodes and processing it for different products. However, this strategy faced multiple challenges:

  1. Maintaining scalable archival nodes, which store the entire state of the blockchain, is costly.

  2. Data extraction from these nodes was often slow, unreliable and expensive.

  3. The repetitive task of data ingestion and the inefficient use of computing resources mirrored challenges seen in big data processing in previous years, leading to significant operational overhead.

Seeing these issues, we identified a need to shift our architectural approach to blockchain data indexing. Our vision was bold:

  1. Make crypto data indexing a more enjoyable job, achieve near real-time performance, improve efficiency by an order of magnitude, and reduce operational costs.

  2. Develop a robust blockchain data availability layer to support enterprise-level crypto applications.

  3. Share our findings and experience with the broader crypto engineering community, contributing to collective improvements in the field.

Introducing ChainStack

Acknowledging these challenges, the Coinbase Crypto Data Foundation (CDF) team has been developing a crypto-native data platform, codename ChainStack, since 2021. 

ChainStack’s design draws inspiration from big data practices, integrating principles like compute-storage separation and change data capture. Utilizing modern data stack technologies such as Apache Spark and Arrow, it ensures efficient data access with minimal serialization overhead.

To structure ChainStack, we’ve established three distinct systems, each serving a specific purpose:

  1. ChainStorage: Focused on the secure and efficient storage of blockchain data, acting like a “read replica” of the blockchain.

  2. Chainsformer: Responsible for facilitating data access through state of the art big data standards

  3. ChaIndex: Building low latency scalable index on ChainStorage

This layered architecture ensures ChainStack not only provides a straightforward yet powerful data and compute model, but also remains versatile enough to adapt to the diverse and ever-evolving blockchain ecosystems.

For completeness of the context, here are some use cases of blockchain data in Coinbase, which can provide insight into enterprises' challenges when integrating crypto into their business flow.

image1

ChainStorage Overview

In a scalable system, robust storage is the foundation. ChainStorage is a blockchain file system that stores blocks as files, acting like a continuously-updated read replica of the blockchain.

image2

The core design principles of ChainStorage focus on:

  1. Data Completeness: We capture all raw blockchain data in a single pass to eliminate the need for multiple queries.

  2. Modularity and Flexibility: We separate data ingestion from processing to preserve raw data, leaving more complex and error-prone tasks like parsing to later stages.

  3. Performance and Trust: ChainStorage leverages a distributed storage system for high availability and scalability, maintaining verifiable trust and facilitating easy integration and a path towards decentralization.

At its core, ChainStorage replicates the source blockchain state into scalable storage, broadening access to blockchain data. Each block finds its home as a file object - a primitive yet effective abstraction - can be stored in any blob storage, with key-value storage acting as the meta storage, much akin to many a file system. Thanks to the nature of file object abstraction, which gives an intuitive developer interface as well as simple structure for parallel computation.

ChainStorage adopts a unique data ingestion approach, shifting from the conventional ETL (Extract, Transform, Load) method to an ELT (Extract, Load, Transform) paradigm. Here's the breakdown:

  • Extract: We pull raw data concurrently from a cluster of load-balanced nodes, with a consensus mechanism managing chain reorganizations.

  • Load: Raw block data is stored in blob storage (like S3), with metadata housed in key-value storage (KV), for instance, DynamoDB. The key-value schema is crafted to allow for parallel ingestion while maintaining a strict order for data retrieval.

  • Transform: Although chain-native and chain-agnostic parsers are part of the package, transformation is postponed till the data ingestion phase is done.

image3

Access to ChainStorage data is a breeze with two pathways:

  • Batch APIs: These are tailored for reading blocks in a horizontally scalable fashion. A conscious trade-off has been made to focus on block-level APIs exclusively, optimizing data schema and locality to achieve read latency that rivals existing indexers built atop relational databases.

  • Streaming APIs: These allow downstream systems to stay in sync with the blockchain state, also providing insights into chain reorganization events.

The data model of ChainStorage is simple and generic, necessitating only monotonically increasing sequences of blocks. It remains indifferent to block content, thanks to interpretation layers per blockchain class as plug-ins. Our journey so far has seen successful integration with a myriad of EVM chains (like Ethereum, Base, Polygon, Avalanche, Arbitrum, etc.), Bitcoin, Solana, and Aptos, paving the way for a more connected and accessible multi-blockchain universe.

ChainStorage is blockchain feature aware

Blockchains are valued for their immutability, but the observed state of the chain can be inconsistent due to blockchain reorganizations (reorgs). This concept is detailed in the following illustration.

image4

ChainStorage handles blockchain fluidity by not overwriting blocks when changes occur, but by logging changes as a strictly ordered series of events marked by additions (+) or removals (-). Each event is tagged with a monotonically-increasing sequential number, providing a solid base for change-data-capture mechanisms later on.

To clarify, the canonical chain can be reassembled by grouping these events by height, and selecting the item with the highest sequence number from each cluster. In the example above, the block stream can be mirrored into a key-value store such as DynamoDB, employing the time-based versioning pattern.

This approach maintains data integrity and adapts easily to blockchain changes. ChainStorage, crypto-aware by design, offers a solid and forward-looking solution in crypto data management.

image5

ChainStorage is designed for enterprise application

The reliability and accuracy of ChainStorage data is crucial, as it supports essential applications. With a focus on enterprise-level dependability and its suite of advanced features, ChainStorage is equipped to handle the complex needs of the blockchain environment, emphasizing its capability for high-stakes enterprise applications.

  • Data Trust: To ensure data trustworthiness, ChainStorage has integrated validation techniques similar to blockchain clients, such as Merkle hash validation, during data ingestion. It also enables data verification through its SDK for consumers, providing the same cryptographic assurances as a local archive node, ideal for vital processes like payment settlements.

  • Node Availability: To tackle the issue of node availability, critical in enterprise settings due to the difficulties in maintaining archive nodes, ChainStorage has a node failover system. It ensures smooth transitions between node sources without service lapses or compromised security, bolstering system resilience.

Performance, metrics, and limitation

For a concrete example, consider a production ChainStorage instance for Ethereum on an AWS cloud deployment. The numbers below are reflective of this specific setup:

  • Data Freshness: 1 to 2s

image6
  • We can scale the throughput (block/second)s theoretically infinitely. In this experiment, when we scaled to ~1000 blocks/second, data freshness has reduced to ~1s. In other words, you can rebuild the database of Etherscan within one hour or less. Note that each block in ChainStorage contains all the artifacts, including transactions, logs, and traces. See here for the Protobuf definition. 

image7
  • Cost: With our implementation, the operating cost to support all use cases across Coinbase is just a fraction of the conventional costs associated with data retrieval and from node providers.   

ChainStorage represents a significant stride in the realm of enterprise-grade blockchain data management. This platform is thoughtfully crafted to ensure data integrity and enhance access speed, directly addressing the complex needs of Coinbase's crypto data ecosystem. 

With its release into the open-source community, we invite collaborators to contribute to ChainStorage's evolution on GitHub. Your feedback and contributions will help us advance this platform, ensuring robust and efficient data availability for the blockchain networks that matter to you.

Coinbase logo
Jie Zhang
Leo Liang
Jian Wang
Ming Jiang

About Jie Zhang, Leo Liang, Jian Wang and Ming Jiang

Senior Staff Software Engineer

Director of Engineering

Senior Engineering Manager

Senior Product Manager, Data