Coinbase migrated their Clustering system from precomputing clusters to a solution that dynamically determines clusters in real-time using a graph database. This architectural shift has enhanced scalability, reliability, and has improved visualization capabilities by enabling dynamic analysis of entity relationships within the graph database.
Clustering is a technique used to identify naturally occurring groups among data entities based on their attributes. The goal of clustering is to organize these entities into distinct groups or clusters, where individuals within the same cluster are highly similar to each other across key attributes like demographics, interests and behavior. Users assigned to different clusters are expected to be dissimilar.
Clustering has been beneficial across various use cases including:
Risk management and fraud detection - Identifying suspicious transactions by analyzing deviations from normal patterns of clusters
Personalized recommendations - Recommend products and services to users based on preferences of others within their cluster
Targeted marketing - Develop focused marketing campaigns tailored to unique consumer clusters
Below, we’ll outline the journey to develop this next-generation architecture through innovative strategies that overcame challenges, resulting in a clustering solution better equipped to support Coinbase's continued growth.
Coinbase has had a clustering system in place since 2015 handling ~150 million clusters, including some containing over 50,000 nodes. This system stored the precomputed clusters in the NoSQL database. Whenever a user attribute was updated, one or more operations - create, merge, or delete - were required on the clusters. As Coinbase expanded its product offerings, this system could no longer meet the evolving demands of teams across growing products.
The logic for grouping users has become increasingly complex and requires a high number of database updates to support each use case. This resulted in performance degradation, increased storage cost, and supporting different read patterns.
Precomputing results can be advantageous for systems that primarily involve reading data. This is because the same computations are performed repeatedly on read operations, so precomputing avoids incurring those costs each time. However, our clustering system is characterized as write-heavy with frequent updates occurring. In write-heavy workloads, precomputing data becomes less optimal over time.
As a result, we needed to reevaluate our previous database choice to better scale the system. We evaluated various database options starting with SQL databases that excel at querying but lack flexibility for changing relationships, as well as different NoSQL solutions like key-value stores, document databases and column-family databases - each with their own benefits but ultimately unable to optimally handle the intricate connections in our data. After a thorough evaluation, we determined that a graph database was best suited to model the interconnected user and attribute information as nodes linked by edges.Graph databases are specifically designed for handling complex relational data, offering a more intuitive method to represent and query user-attribute connections while providing robust traversal capabilities.
The clustering system follows an event-driven design pattern - commonly used across Coinbase. It has three main components: Data ingestor, Storage, and API Server. Data ingestor is responsible for consuming multiple events emitted by various domain team Services. Storage is responsible for saving the data and the API Server is responsible for supporting queries from various Services.
At a high level, the architecture remains the same between our old and new clustering system, the major change being how the data is stored in both systems.
Below is a snapshot of the key changes in the database layer between the old and new clustering system.
Clustering V1
One of the limitations of our Clustering V1 system was its use of a storage solution that lacked flexibility to adapt as the relationships between users and their attributes evolved. For instance, if a user added a new bank account, the system should search for an existing user with the same bank account and link them. Similarly, if a user removed a bank account, the link between the two users should be removed. Now, imagine performing these operations across hundreds of attributes for millions of users.
Clustering V1 used a NoSQL based database as the primary datastore for computing user attributes and storing cluster information. These attributes were then fed into a worker that queried another database to check if any cluster updates were needed for the impacted users. If updates were necessary, the clusters were updated and the new cluster information was stored in the database. This process often led to race conditions within our workers, resulting in inconsistencies and scalability issues. Additionally, the system was not flexible enough to support clustering users based on different attributes. For example, the FooBarA service wanted to cluster users based on IPAddress, social security number (SSN), and phone number, while FooBarB wanted to cluster users based on social security number, phone number, and payment methods.
Clustering V2
For Clustering V2, we pivoted to using a graph database (Amazon Neptune) to model the relationships between users and their attributes. Users and their attributes are modeled as nodes, while the relationships between are directed as edges. If a user adds a bank account, a new edge is created between the user node and the bank account node with a relation "has-bank_account". Similarly, if the user deletes the bank account, the edge between the user node and the bank account node is deleted. This model is extensible as it allows for any number of attributes to be added and enables each domain to query data according to their specific use cases.
For example, in Clustering V1, if one use case wanted to find similar users based on SSN and phone number while another use case wanted to find similar users based on SSN, phone number, and address, we would need to precompute and store the clusters for both use cases in the database. However, with the help of a graph database in Clustering V2, such queries do not need to be precomputed and can be performed at runtime.
This approach allows stakeholders to modify the attributes used for defining similarity between users and fine-tune them as their use case evolves. This model can also scale to support hundreds of use cases with varying needs, since there is no precomputation involved.
Another major benefit of moving to a graph database was the elimination of continuous updates to cluster information in the database. In Clustering V2, updates are localized to the user node and its associated attribute, leaving all other nodes untouched.Since clusters were precomputed in V1, if a new shared attribute was found between two clusters, it required a cluster-merge operation, resulting in thousands of row updates.
Clustering V2 provides a get-related-users API that clients can use to find related users based on a given source user. This API accepts a list of attributes from a predefined set, allowing clients to fetch users similar to the specified source user. By utilizing a graph database, we have eliminated the need to pre-compute clusters, offering clients a more dynamic and flexible interface for querying related users in real-time.
In Clustering V1, onboarding new use-query patterns required pre-computing clusters, which resulted in long lead times for onboarding new clients or modifying existing client logic. However, in Clustering V2, all relationships between users and their attributes are stored in a graph database. This enables clients to find related users at runtime without any onboarding effort.
Below is the API proto exposed to the clients.
Below is how Clustering V2 queries neptune in golang for each attribute provided as input by the client.
One of the biggest pain points in Clustering V1 was the lack of visualization on how two users were related. Since the data was stored in a NoSQL database and the related users could be multiple levels deep, understanding these relationships was challenging. There was no UI available to visualize the connections between users or to identify which attributes linked them together. Additionally, external factors that might have linked two accounts were difficult to identify, making it hard for various teams to debug and determine the root cause of issues.
With the transition to a graph database in Clustering V2, we have introduced visualization tools that help stakeholders better understand why two users are linked. For example, below is a screenshot of the current system's visualization:
Blue notes signify the source user.
Green, yellow, and red nodes signify the users related to the source user.
Attr1, Attr2 represents the parameters connecting the users to the source user.
The above visualization enables the various stakeholders to find the common attributes linking with the source users and perform appropriate action.
Support for Multiple Use-Cases: Enable finding related or similar users effortlessly, accommodating various scenarios without additional effort.
Enhanced Reliability: Eliminate race conditions and the need for multiple database updates, resulting in more robust performance.
Enable Visualization: Introduce tools to visualize user relationships, reducing analysis time by 60%.
Cost Efficiency: Achieve 30% savings in storage costs by eliminating redundant info and computing the clusters at runtime.