Coinbase Logo

Language and region

Part 2: Chainsformer: The big data adapter of ChainStorage

TL;DR: The second installment in a three-part series delves into Coinbase's enterprise-grade blockchain data platform. The focus is on Chainsformer, exploring its architecture and processes that streamline both streaming and batch processing of blockchain data at scale. The article highlights how Chainsformer addresses challenges and ensures native compatibility with major big data ecosystems, facilitating the integration of blockchain data for optimal freshness and high throughput scenarios.

By David Lai, Leo Liang, Jie Zhang, Jian Wang

Engineering

, December 11, 2023

Coinbase Blog

Chainsformer is the data transformation layer - enables streaming and batch processing of blockchain data from ChainStorage at scale.

However, as we grow beyond a few use cases, we have identified certain challenges associated with its API-centric approach:

  1. Complexity & Manageability: Realizing the full capabilities of ChainStorage requires a deep understanding of parallel computing. Managing concurrent processes can be daunting and can lead to intricate code structures. For example, a seemingly straightforward task like data backfilling may require developers to design intricate workflows that coordinate hundreds of simultaneous workers for multi-tiered data transformations. This can be a complex challenge both in terms of coding and management.

  2. Efficiency & Scalability: Despite ChainStorage offering high-throughput APIs, its dependence on JSON as the primary data format presents certain limitations. While JSON is suitable for lightweight data exchanges, it becomes less efficient for handling extensive data transfers due to the frequent requirement for Serialization/Deserialization (SerDe) when data crosses application boundaries. Even with Protobuf, which is designed for optimized serialization, it lacks the storage and performance enhancements needed for applications that heavily rely on analytical data reading. Furthermore, as workflows grow in complexity, the need for more streamlined workflow management intensifies.

  3. Composability & Reusability: Blockchain data is multifaceted. From blocks, and transactions, to inner transactions, each layer unveils essential data primitives and requires specific parsing. These foundational elements are pivotal for an assortment of workflows, requiring data to be stored in a unified format, allowing for easy integration and reuse of data across different parts of the workflow.

Introducing Chainsformer

Chainsformer is the data adapter meticulously engineered for seamless integration with ChainStorage APIs. Its creation is driven by the intent to address the challenges we've previously highlighted. Chainsformer has the following benefits:

  1. Big data native approach: With Chainsformer, data from ChainStorage can be intuitively integrated with major big data ecosystems. It's perfectly compatible with platforms like Spark, Anyscale Ray, and various other compute engines.

  2. Operational Efficiency: Chainsformer ensures scalable, on-demand data backfill capabilities. This feature is invaluable for seamless product iterations and ensures uninterrupted data availability.

  3. Canonical Blockchain Primitives: Chainsformer sets up a unified upstream data pipeline. Its chief function is to perpetually replicate block data and blockchain primitives from ChainStorage into various data sinks. This design promotes data reusability, curtails data redundancy and boosts cost efficiency.

  4. Customizability: Chainsformer provides users with flexible options. These choices empower them to strike the right balance between data freshness and consistency, catering to their specific requirements.

Empowering a Range of Use Cases

Chainsformer is a versatile tool that caters to a diverse array of business needs at Coinbase. Its capabilities enable a wide range of use cases, from fresh-data applications that provide real-time insights to high-throughput operations that handle large volumes of data:

  • Marketplace: Chainsformer plays a pivotal role in the Coinbase NFT Marketplace by offering super-fresh low latency data for activities like minting and transfers. This advantage empowers customers to stay at the forefront of trends, enabling them to buy or sell NFTs ahead of others.

  • Compliance: Chainsformer facilitates the creation of link analysis in a graphical representation for compliance purposes. This allows for the tracking of funds across multiple hops, which is data-intensive but essential for regulatory compliance.

  • Security: In the realm of security, Chainsformer's high-throughput capabilities are harnessed to construct heuristics with speed and flexibility. This agility is paramount for identifying and addressing potential security threats and maintaining the system's integrity.

Chainsformer Architecture

Chainsformer Data Gateway and Chainsformer Compute Engine

stpHg8CvYOMn

Chainsformer's strength lies in its solid foundation, built upon two pivotal components: the Chainsformer Data Gateway and the Chainsformer Compute Engine. Let's explore the roles they play and how they synergize to ensure smooth and efficient operations.

To cater to a broad spectrum of business demands, the design options have been customized to align with the lambda architecture. This ensures that accessing Chainsformer is a seamless endeavor, providing two distinct pathways that can be adapted to suit a diverse array of business requirements:

  • Live Data Stream: The Chainsformer Data Gateway is purposefully engineered to facilitate unified columnar in-memory data formats, ensuring an efficient and responsive data access experience. It excels in providing real-time data streams for immediate insights.

  • Batch Processing: On the other hand, the Chainsformer Compute Engine is thoughtfully designed to support scalable, high-throughput, and fault-tolerant stream processing of live data streams. It is the go-to solution for handling data-intensive batch processing tasks with precision and efficiency.

sY1Hi2flbkNz

Together, these components create a dynamic and versatile platform that caters to a spectrum of data processing requirements, making Chainsformer an invaluable asset for a wide range of businesses.

A Native Adaptor for ChainStorage Optimized for Big Data

The Chainsformer Data Gateway is an Apache Flight RPC service responsible for the parallel retrieval of data from ChainStorage. It simplifies the complexities involved in parallel data retrieval, providing a unified interface to the compute layer. Apache Arrow serves as the on-wire format for several reasons:

  • Interoperability: Apache Arrow is language-agnostic, enabling the utilization of the most suitable programming languages (e.g., Java and Python) and libraries (such as smart contract decompiler, Spark GraphX, TensorFlow) for constructing data pipelines.

  • Horizontal Scalability: Addressing the needs of large datasets, the format supports parallel data streams. By distributing data transfer across numerous nodes, Spark worker nodes can efficiently process incoming data across a distributed network.

  • High Efficiency: Apache Arrow stands out by maintaining a uniform memory representation that remains consistent across various programming languages and throughout the data transmission process. This results in seamless data transitions, eliminating the necessity for serialization or deserialization while minimizing memory copying.

In its present design, the Chainsformer Data Gateway operates as a RPC service. However, it's worth noting that it can also be embedded or function as a sidecar component, bringing it in closer to the Chainsformer Compute Engine for better data locality. Furthermore, data transfer speeds are solely dictated by the network between the Chainsformer Data Gateway and the chosen blob storage provider, such as S3, ChainStorage is no longer a data transfer bottleneck, thanks to the file-based API offered by ChainStorage to provide high-throughput, reorg resistant raw data streams of onchain data via its gRPC endpoints.

The following basic endpoints are supported by the Chainsformer Data Gateway:

  • GetSchema: Provides the schema for a data stream, enabling the client to understand how to deserialize data from the subsequent DoGet API.

  • GetFlightInfo: Generates "N" flight tickets, each representing a specific task for processing blocks or events within defined ranges based on criteria like StartHeight, EndHeight, StartSequence, EndSequence, and partition rules.

  • DoGet: Transmits a data stream of ChainStorage blocks or events to a client based on the "tickets" obtained from GetFlightInfo, facilitating data transfer.

  • DoAction: Executes implementation-specific actions and provides results, functioning as a generalized function call for specific tasks.

  • ListFlights: Returns a list of available data streams or supported table names.

Chainsformer Compute Engine: The Heart of Big Data Processing and Transformation

Chainsformer Compute Engine is the data processing center where raw data undergoes transforming from its raw form into datasets that power applications. This component encapsulates both the logic, execution, and management of data transformations.

In our current implementation, Spark takes center stage as the primary compute engine, responsible for converting the blocks supplied by the Chainsformer Data Gateway into live, actionable datasets. The choice of Spark as our computational powerhouse offers our users a wealth of benefits, simplifying the transformation of data tables from ChainStorage to their desired destination.

One of the standout advantages of using Spark is its mature and well-rounded compute ecosystem. With it, users can effortlessly perform the task of data transformation, bypassing the need to deal with complex intricacies such as partition management, retries, and various operational nuances.

Let's illustrate the power of Spark with a real-world example. In just a few lines of Python code, you can achieve what would otherwise be a complex and time-consuming task. This elegant simplicity not only saves time but also makes the data transformation process accessible to a broader audience.

blockchain = 'ethereum'  
network = 'mainnet'  
table = 'blocks'  
table_name = f'chainsformer_{blockchain}_{network}_{table}'  
schema_name = 'user_name'  
df = spark.readStream.format("cdap.org.apache.arrow.flight.spark")\  
.option("host", f"data-chainsformer-{blockchain}-{network}-dev.cbhq.net")\
.option("port", 9090)
.option("blocks_start_offset", 15000000)\  
.option("blocks_per_partition", 1000)\  
.option("blocks_max_batch_size", 100000)\  
.load(table)
df.writeStream.option("checkpointLocation",  f"FileStore/users/user_name/checkpoints/chainsformer/{table_name}")
.toTable(f"{schema_  name}.{table_name}")

This code snippet showcases how straightforward the data transformation process becomes with Spark. It empowers you to focus on the essential task of data manipulation, making the entire process more efficient and user-friendly. With Spark and Chainsformer, we are dedicated to simplifying the complexities of data management, ensuring that you can harness the full potential of your blockchain data effortlessly.

Chainsformer goes beyond being a mere transformation tool; it stands as a testament to our unwavering commitment to enhancing efficiency and promoting the spirit of reusability. Within Chainsformer, users gain access to standardized datasets that encapsulate the fundamental components of blockchain data. Presently, these datasets encompass critical elements such as blocks, transactions, and events, all securely stored within Delta Lake. These datasets serve as some of the foundational building blocks for Coinbase's data operations.

For those who prioritize the timeliness of their data, the Chainsformer Data Gateway emerges as a well-balanced solution. Alternatively, users seeking a more extensive API with even greater data freshness can turn to ChainStorage's raw API. Our approach revolves around offering a spectrum of choices tailored to the distinct requirements of our user community. This diversity ensures that each user can find the most suitable solution to meet their unique needs.

Performance

Throughput: Lightning-Fast Data Access

In the fast-paced world of blockchain and cryptocurrency, having access to data with exceptional throughput is a game-changer. When it comes to data sourced from Chainsformer, you can expect more than a staggering 1 ~ 2 GB per second throughput, making indexer backfills completed within a few minutes.

  • Batch processing throughput: 1GB+ per second

  • Live data stream throughput: 1k+ blocks per second (varying by blockchain network)

In the context of data sourcing, throughput is the rate at which data is processed and delivered to the end-user. High throughput enables you to access and analyze vast amounts of information in record time, giving traders, investors, and developers the edge they need in the fast-moving blockchain market.

Data Freshness: Near Real-Time Insights

Throughput is essential, but it's only part of the equation. Data Freshness, as delivered by Chainsformer, ensures that the data you receive is less than 5 seconds away from the tip of the chain. In the dynamic world of blockchain, this is of utmost importance. Real-time data, with a freshness guarantee, allows you to make decisions with confidence.

  • Data freshness: <5 seconds from the tip of the chain (varying by blockchain network)

This level of data freshness is a game-changer. It means you're always working with the most current data, whether you're a trader looking to seize market opportunities or a developer building the next big blockchain application. In an industry where every second counts, Chainsformer ensures you're never left behind.

In the ever-evolving world of blockchain and big data, Chainsformer stands as a pioneering adapter of ChainStorage. By embracing the tools and techniques of today's dynamic big data landscape, we have not only enhanced ChainStorage's scalability but have also elevated the overall user experience. Our mission is clear - to seamlessly bridge the intricate world of Web3 data with the vast realm of big data.

With the invaluable assistance of powerful compute platforms like Spark, Chainsformer marks a significant step towards establishing a more comprehensive blockchain index platform. This is not just a technical achievement; it's a testament to our commitment to pushing the boundaries of what's possible in the blockchain data ecosystem.

As we journey forward, we remain dedicated to learning, adapting, and evolving. Each iteration of Chainsformer is a step closer to perfection, and it's our unwavering aspiration to serve our community better with each advancement. We believe in the collaborative spirit of the blockchain and open-source community, which is why Chainsformer has been proudly open-sourced.

We look forward to your valuable feedback and contributions as we continue to shape the future of blockchain and big data integration. Together, we can build a stronger, more connected, and more informed blockchain ecosystem for all.

Coinbase logo
David Lai
Leo Liang
Jie Zhang
Jian Wang

About David Lai, Leo Liang, Jie Zhang and Jian Wang

Staff Software Engineer

Director of Engineering

Senior Staff Software Engineer

Senior Engineering Manager