Tl;dr: This is the final blog post in a three-part series exploring ChainStack, Coinbase's enterprise-grade blockchain data platform. In this series, we covered the architecture, technology, and processes that make ChainStack a powerful foundation for data availability, computation, and indexing. In this post, we take a deep dive into using ChainStack to build blockchain indexers for various applications.
Having previously unveiled ChainStack, Coinbase crypto-native data platform, it's essential to delve deeper into its components. ChainStorage serves as the crypto data backbone, ensuring robust storage and seamless data availability. Meanwhile, Chainsformer is the adapter to bring web3 data into the big data world.
In this piece, we'll take you on a journey through the construction of a blockchain indexer from a data-centric perspective. Along the way, we'll also share invaluable insights gleaned from various use cases.
At its core, a blockchain indexer is an application dedicated to constructing the data layer for blockchain applications. Its primary role involves extracting data from various blockchain sources. Occasionally, this process integrates additional off-chain data. The ultimate goal is to transform this amalgamation of data into a format that's easily digestible and friendly for applications.
Navigating the intricacies of a blockchain indexer, we've distilled some key components from our exploration. Each of these components presents its own unique set of challenges:
Data Sources - The credibility and reliability of data sources are paramount. They provide the foundational truth and significantly influence the indexing process’s scalability and overall performance.
Compute Layer - This is the heart of the processing realm. It's where data is extracted from various sources, computations are performed, and the resultant data is stored. Whether through a straightforward program in Golang or Python or more specialized platforms like Spark, Apache Flink, or AnyScale Ray, the choice of implementation can vary based on specific needs.
Orchestration Engine - A crucial component that ensures the timely execution of the compute step. It also oversees operational nuances like batching, retries, and back-offs. While Temporal is a widely recognized orchestration engine, tools like Airflow are more aligned with Data Science-driven workloads.
Data Sink: This is the storage vault where the processed data is housed.
Data Serving - This service delivers the data to the application through APIs. In certain scenarios, applications might directly tap into the data sink, especially if they lean towards a data-native construct.
We found that designing an efficient and production grade blockchain indexer is a difficult task. Over the years, various teams have crafted tailored solutions to index blockchain data, each one optimized to specific use cases — and while these custom solutions met immediate needs, they often hit scalability roadblocks.
We believe constructing a blockchain indexer shouldn't be an intricate puzzle. Ideally, 90% of the process should be standardized with a framework. It should be user-friendly, swift, dependable, and ready to roll out with top-notch quality within a very short period of time.
Enter ChainStack Indexing Platform.
ChainStack Indexing Platform offers a streamlined crypto data processing journey from Data sources like ChainStorage and canonical crypto datasets to the Compute layer with Chainsformer and Spark and an orchestration engine. It's flexible, too, allowing users to integrate their storage solutions (Data Sink) and design their APIs (Data Serving) as they see fit.
ChainNode - provides ETH RPC interface to ChainStorage EVM chains. It is not just another node in the blockchain space. Touted as a Node as a Service (NaaS), Chain Nodes evolution has been in response to the real-world problems that dapps and blockchain platforms frequently encounter when interacting with traditional nodes.
A common challenge that applications face when querying nodes is the latency and potential downtime. Nodes, in their native state, aren't built with high availability as a core feature. Moreover, attempting to scale by querying from a pool of load-balanced nodes presents its own set of issues. States across these nodes only achieve eventual consistency, which can lead to data discrepancies and inconsistencies issues between multiple API calls – a significant problem for applications demanding real-time, accurate data.
To combat these challenges, ChainNode was envisioned as a highly available and horizontally scalable service, drawing inspiration from the Change Data Capture pattern.
Architecture:
Data Source: At the heart of ChainNode is the data sources component, which is built on Chainstorage. Chainstorage is known for its performant and high-throughput capabilities, particularly when it comes to handling raw block storage. This is a critical feature for a platform dealing with blockchain data, as it ensures that large volumes of data can be processed quickly and efficiently.
Compute Layer: The compute layer of ChainNode employs Golang scripts. Golang, known for its simplicity and efficiency, is an ideal choice for blockchain applications. It enhances the speed and reliability of the data processing within the ChainNode environment.
Orchestration: Orchestration within ChainNode is managed by Temporal workflows. These workflows are crucial for coordinating the various processes and tasks within the platform, ensuring that everything runs smoothly and efficiently. The orchestration engine plays a pivotal role in maintaining the high performance of ChainNode, especially when dealing with complex data structures and queries.
Data Sink: On the data management front, ChainNode utilizes DynamoDB as its data sink. DynamoDB's fast and flexible NoSQL database service is an excellent fit for ChainNode, offering seamless scalability and reliability. This choice further solidifies the platform's capabilities in managing large-scale blockchain data efficiently.
Data Serving: ChainNode uses a Golang rpc service that implements the Ethereum rpc interface. This service is a key component in the architecture, facilitating efficient data retrieval and interaction. It allows ChainNode to serve data in a manner that's compatible and efficient, particularly for applications built on or interacting with the Ethereum blockchain.
The core of ChainNode's efficiency lies in its rapid key-value store, which backs the onchain data. This design allows for highly efficient data access and management, a critical feature for any blockchain-based platform. By relying on this key-value store for many requests, ChainNode significantly enhances query response times, bypassing the need for direct on-chain queries which can be slower and more resource-intensive.
ChainNode ensures continuous data replication from ChainStorage. This process involves synchronizing changes from a compact node cluster, which are then re-indexed to cater to various query patterns. This ongoing replication is crucial for maintaining up-to-date data and ensuring that the information provided by ChainNode is both current and accurate.
Outcome
ChainNode's architectural pivot from the supernode construct offers a unique value proposition in the blockchain space. By optimizing data storage and retrieval processes, it ensures that applications can interact with the blockchain in near real-time, without the typical scale bottlenecks. The cost of running the cluster of nodes is also much cheaper compared to a more traditional cluster of nodes.
ChainNode addresses a common challenge in web3 applications: the need for specific query patterns that standard nodes do not provide. This issue is crucial as the demands of blockchain applications evolve, requiring more complex and varied data retrievals.
To illustrate this point, let's consider a few examples of such queries.
One frequent requirement is the ability to list all transactions related to a specific address. This includes not just native transfers but also interactions through smart contracts.
Users often need to view the balances of all assets for a given address at any given block. This requires an in-depth analysis of the blockchain to accurately reflect the state at any given point in time across multiple contracts.
There's a need for listing the balance history for an asset or an address, which involves tracking and compiling data over time to understand how balances have changed.
To meet these needs, we set out to create an indexer capable of quickly returning these specific queries for all blockchains in a consistent format. The goal was ambitious: to achieve a median latency of less than 10 milliseconds and a p99 latency of less than 100 milliseconds. Furthermore, we aimed to maintain these performance benchmarks at a significant scale, handling tens of thousands of requests per second. Achieving this level of performance and scalability was imperative to meet the growing demands of web3 applications.
When we break down this challenge, we find that an indexer is fundamentally about extracting, transforming, and aggregating onchain data. This process is designed to support fast and scalable queries, which are essential for the efficient operation of blockchain-based applications. Our approach to building this indexer was guided by these core functions, ensuring that the resulting tool was not only powerful but also versatile enough to handle a wide range of query requirements.
Addressing the following challenges is critical to the success and relevance of the canonical indexer:
Support multiple blockchains
Each blockchain has unique features and requirements. Even within the same EVM family, layer 2 solutions vary in token economics and fee calculations.
Handling Diverse Asset Classes
Native tokens, ERC-20, ERC-721, ERC-1155, UTXO, ERC-4337, among others.
Each standard comes with its own set of corner cases and contracts with non-standard contract implementations.
Scaling the indexing performance
Newer blockchains and Layer 2 solutions have faster block speeds and higher throughput compared to chains like Bitcoin and Ethereum.
Indexer jobs need to scale effectively to keep up with these advancements, particularly for rapid backfilling.
To address these challenges, we focused on building a robust framework that enhances our iteration speed. This framework is designed to be agile and adaptable, enabling us to quickly respond to changes and new requirements in the blockchain space. Furthermore, we leveraged automated validation to help scale our operations and identify issues efficiently. This approach ensures that our indexing platform remains cutting-edge, reliable, and capable of handling the diverse and dynamic demands of the crypto world.
Overall Architecture
The overall architecture of our platform is a sophisticated blend of technologies and strategies, designed to efficiently manage and serve blockchain data. Here's a breakdown of how different components come together:
Data Sources: Our system leverages a variety of data sources. These diverse sources ensure that we have comprehensive and up-to-date data for processing and analysis.
Chainsformer
S3 parquet files
Previous computed states in Dynamodb
Data Sinks: The data is stored in different sinks, each tailored for specific needs.
DynamoDB is used for real-time queries.
S3 handles large transactions and payloads.
Delta Lake is utilized for data analytics.
Kafka events are employed for notification purposes.
This multi-sink strategy allows us to efficiently store and manage data according to its usage and requirements. There is no single solution for all scenarios, especially for hot partitions, such as when an address holds 17 millions tokens from a single NFT collection. Our approach includes:
Using mixed cloud-native storage solutions.
Implementing compaction strategies to reduce write loads.
Data Serving: We use a Golang service running in Kubernetes for data serving. This setup provides robustness and scalability. Caching is implemented for watermarks and immutable data like contract metadata, enhancing response times and reducing load.
Spark Compute: Spark brings several beneficial properties to our platform:
It's fault-tolerant, ensuring data integrity and system reliability.
Experiments can be conducted in notebooks, offering quick results and iterations.
The system scales to hundreds of machines, accommodating many independent jobs.
We benefit from shared libraries and tools, promoting consistency and efficiency across processes.
A set of common components such as UDFs and configurable sources and sinks have been developed.
We have standardized job deployment and management tools. This includes things like exclusive deployment locks to allow only one job to run during updates, ensuring smooth and error-free deployments.
In essence, the architecture we have developed is a balanced mixture of advanced data processing tools, diversified data storage strategies, and innovative solutions for scalability and efficiency. This design enables us to handle the complexities and challenges of blockchain data management effectively.
Historical Balance Calculation Job
Various elements of our architecture are integrated within a Spark job — exemplifying some of the key features and functionalities we have discussed.
Deployment Lock in CI/CD Pipelines: The Spark job incorporates a deployment lock mechanism. This feature is crucial for integrating with CI/CD (Continuous Integration/Continuous Deployment) pipelines. It ensures that updates and deployments are conducted without conflicts or overlaps, maintaining the integrity and stability of the system during updates.
Spark Stream Processing: At the core of the job is Spark's stream processing capability. This allows for the real-time processing of data streams, a critical aspect for handling live blockchain data. The streaming process is designed to be efficient and robust, capable of handling large volumes of data with high throughput. In this case, we’re streaming data from Chainsformer’s transaction stream.
Automatic Checkpoints and Failure Resumption:One of the standout features of this Spark job is its ability to automatically create checkpoints and resume from failures. This means that in the event of a system failure or interruption, the job can pick up from the last checkpoint without losing data or progress. This feature is essential for ensuring data integrity and continuity of operations, especially in an environment where data is constantly being updated and processed.
Together, these features create a Spark job that is not only efficient and effective in processing and managing data but also robust and reliable, capable of handling the complexities and demands of blockchain data processing. The integration of deployment locks, stream processing, and automatic checkpoints demonstrates our commitment to building a system that is both high-performing and resilient.
Calculating historical balances for an address presents a significant challenge for a couple of key reasons. Firstly, the nature of balance calculation is inherently sequential. This sequential process is required to accurately track the changes in balance over time. Additionally, the structure of the blockchain itself can change due to occurrences known as reorganizations, or "reorgs". These reorgs can alter the blockchain's history, adding another layer of complexity to the task.
Together, these factors make the scaling of historical balance calculations particularly demanding. When you consider the vast scale of this task – encompassing tens of millions of blocks and hundreds of millions of addresses – the challenge becomes even more apparent.
To illustrate this, let's take a look at a reorg scenario as shown in the following diagram. This will help us understand the complexities involved in recalculating balances when the blockchain undergoes such structural changes.
At T2, the canonical chain is [1A, 2A], at T4, the canonical chain is [1A, 2B, 3B]. In order to correctly incrementally calculate the balances for an address at a block, we follow the following formula:
In the above example, balance at 2A is delta(2A) + balance(1A). And balance at 3B is delta(3B) + balance(2B)
For the above scenario, the block stream will produce the following sequences.
Notice the block sequence will produce a monotonically increasing sequence number. If we store the balance at each block, we can leverage the sequence number to determine the correct balances at its parent block.
In the Spark job, we follow a structured process to manage and calculate balances efficiently. Here's an overview of the steps involved:
Extracting Deltas per Address and Block: We start by extracting the delta (change in balance) for each address for each block.
For blocks that are added, we add the amount to the receiving address and subtract it from the sending address.
Conversely, for blocks that are removed, we subtract from the receiving address and add back to the sending address. We also mark these delta entries with a flag to indicate they are from a removal event.
Data Organization: The data is then partitioned by the address and sorted by sequence to facilitate easier processing.
Fetching Previous Balances: We fetch the previous balance for each address from the database. This is identified as the balance from the balance table where the block height is less than the current height, sorted in descending order by sequence, with a limit of one. If there is no previous balance available, we default it to zero.
Accumulating New Balances: We use a window function to accumulate the new balances at various heights. This is done by aggregating the delta to the previously found balance in step 3.
Filtering Out Redundancies: Balances for removed blocks are filtered out as they are duplicated entries.
Updating the Database: The calculated balances are then inserted into the Mongo table.
Updating Watermarks: Finally, we update the watermark entry, which helps in tracking the progress of data processing.
In our canonical indexer, steps 3 and 4 are optimized through a Stateful Aggregation function. This function stores the current balance in the state store. By doing this, we avoid the need to repeatedly fetch balances from DynamoDB (or any KV storages), which significantly improves both cost efficiency and performance. This streamlined approach ensures a more efficient and effective processing of blockchain data.
It's evident that indexing onchain data to cater to the diverse needs of various applications is fundamentally a data processing endeavor. This process must be adept at supporting a wide array of access patterns, a challenge we've thoroughly explored here. Our journey through the intricacies of this task has unveiled several obstacles inherent in dealing with blockchain data. Yet, by leveraging a diverse set of tools and innovative approaches, we've crafted solutions that not only meet these challenges but also set a precedent for efficiency and scalability.
Our hope is that the insights shared in this post will serve as a source of inspiration for others in the field and encourage the development of scalable indexing solutions tailored to specific needs. By sharing our experiences and strategies, we contribute to the collective advancement of the crypto domain. This collaborative effort is key to unlocking further utilization and innovation in this rapidly evolving space.
Together, we can advance the state of crypto, opening doors to new possibilities and applications.
About Henry Yang, Leo Liang, Jie Zhang, Jian Wang and Ming Jiang
Staff Software Engineer
Director of Engineering
Senior Staff Software Engineer
Senior Engineering Manager
Senior Product Manager, Data
Engineering,
Mar 19, 2025