Coinbase Logo

Language and region

Accelerating Deep Learning Adoption at Coinbase

TL;DR EasyTensor is a Python framework that simplifies building, training, and deploying deep learning models on tabular and multimodal data. It provides modular blocks for preprocessing and model design and leverages Ray for scalable distributed training to accelerate adoption and improve performance across teams.

By Ralph Edezhath, Roman Burakov

Engineering

, December 3, 2024

Coinbase

Introduction

At Coinbase, we use machine learning for a diverse set of applications, from targeting notifications to detecting account takeovers. Most of these use cases involve multimodal data, combining text, numeric, category, and sequence features. Training deep learning models on multimodal data requires complex pre-processing and model architecture, which means engineers often lean towards simpler solutions like gradient-boosted trees. We built a framework called EasyTensor which simplifies this by providing modular blocks for data preparation and model design. This framework accelerated the adoption and development of deep learning models, scaled dataset sizes by 100x, supported new model architectures, and ultimately drove significant performance improvements and business impact across multiple teams.

Overview

For tabular data, Gradient-Boosted Tree (GBT) packages enable training models with minimal pre-processing. Similarly, when datasets are purely natural language, NLP packages enable model training or fine-tuning on popular architectures with minimal effort. In contrast, training a deep learning model on tabular and multi-modal data requires:

  1. Pre-preprocessing for different modalities – e.g. normalizing numerical features, encoding and truncation for categorical/sequence features, tokenization for natural language.

  2. Different architecture components – e.g. embeddings for categorical features, LSTMs for numerical sequences, transformer layers for categorical sequences.

  3. Custom training logic – e.g. configuring optimizers, dataloaders, early stopping, logging.

Despite this added complexity, deep learning models offer potential advantages over off-the-shelf models like GBT: native support for sequence features, full control of feature representations and interactions, and more sophisticated model structures.

EasyTensor aims to simplify this with a developer experience that’s similar to an off-the-shelf model, while offering the capabilities of deep learning models. The core abstraction which enables this is the EasyTensor block, which includes both pre-processing and the neural network layers. To build and train a model, a Machine Learning Engineer merely picks the blocks appropriate for the feature types involved in a given problem, and EasyTensor handles feature processing such as fitting encoders, as well as constructing and training the neural network. This reduces the barrier to entry for experiments with bigger and more capable deep learning models.

easy tensor

Example EasyTensor block for sequences of categories (e.g. user events).

EasyTensor also includes various features and tooling to enhance developer experience:

  • Logging of training, validation metrics, model checkpoints and code to MLFlow

  • Fault-tolerant restarts from uploaded checkpoints

  • Feature importance computation

  • Automatic inference optimization with TensorRT

Usage Examples

1. Two-Tower Models

unnamed (1)

Example structure for a two-tower model built with EasyTensor.

Inference times can be very long for ranking and recommendation systems with a large number of user and item pairs. Two-tower (or factorized) models can greatly reduce inference time via a model architecture which consists of two disjoint stacks of layers which interact only via a dot product at the very end. The last layer output of these towers are user/item embeddings that can be computed and stored independently. This allows us to pre-compute one set of embeddings (e.g. for all users) in an offline job and only compute a small subset of embeddings in real-time.

Users rely on Coinbase notifications to track crypto market movements, but sending too many of them can fatigue users. To address this, we use Machine Learning to only notify users that are most likely to be interested in receiving the notification. A two-tower model enables computing model scores for a very large number of users in near-real time: a daily batch job computes user-embeddings and stores them in memory; when a price change occurs, we generate the asset embedding and rank user/asset pairs by taking the dot product with stored user embeddings.

EasyTensor is particularly useful for building these kinds of models. The Machine Learning Engineer merely specifies the blocks for each entity (user or item) and the package then builds the disjoint tower architecture and also includes abstractions for computing the individual item embeddings once the model is trained. This is not possible with traditional tree-based algorithms, and we have seen significant adoption of two-tower models since adding them to EasyTensor, resulting in more efficient targeting.

1. General-Purpose User Embeddings

self supervised user embedding

Model architecture for self-supervised user embedding.

Our feature store now includes over a thousand different features built for various specific use cases, with associated maintenance and monitoring costs. An alternative approach is to build user representations from a sequence of events, in a self-supervised manner, similar to how language models are trained. In addition to simplifying the feature engineering process, such embeddings can also improve the performance of downstream models through transfer learning. EasyTensor simplifies the process of using entity embeddings in downstream tasks, since we can compute the input pre-processing and model output for just a single block which is then fed into the task-specific model.

There are many ways one can build general user embeddings. One example is using sequences of in-app user events as a way to represent user behavior. Given a pair of user event sequences - i) events from the latest session and ii) all prior historical events, the learning objective is to predict whether these belong to the same user or different users (randomly sampled). In the positive samples, the two pairs of sequences are from the user, and in the negative samples, the latest session sequence is from a randomly sampled user. In learning to classify these, the model encodes each user’s distinctive interaction patterns and performs deep user segmentation. The modular structure of EasyTensor simplifies the use of these embeddings in downstream models by exposing them through a dedicated block type.

pre trained embeddings

Pre-trained embeddings are passed into downstream models via a dedicated block.

Distributed Pre-processing and Training

data flow

Data flow diagram showing what computations are done within Ray.

Ray enables both distributed data processing and model training at scale. For example, a categorical encoder requires computing feature frequencies, and with clusters using Ray we are able to fit encoders for datasets 100x bigger than possible on a single instance. But we have observed that Ray has a steep learning curve, and subtle differences in configuration options can greatly affect performance. For example, Dataset operations by default use all available resources in the cluster, which means that in order to compute frequencies for multiple columns, it is more efficient to execute a single map_batches transformation which counts values for multiple columns in a vectorized way, as opposed to running multiple Dataset transformations in parallel with each counting values for a separate column. With EasyTensor, these nuances can be abstracted away and engineers are able to train models at scale without needing to learn the Ray API.

Beyond scaling, we also enhanced our workflows with features for reproducibility, observability, and fault tolerance. Every training run automatically logs all training and validation metrics to MLFlow, along with configurations, code, and model checkpoints. We compute and record feature importances using distributed inference at the end of each training session. In case of any interruptions during training, Ray allows us to resume from the last saved checkpoint, minimizing training time and resource expenditure.

Conclusion

Coinbase’s EasyTensor framework enabled the rapid training and deployment of deep learning models for various applications involving multi-modal data, driving improvements in customer experience. Using a common framework also allows engineers to use cutting-edge distributed processing engines like Ray to train models at scale, benefitting from best practices discovered by early adopters.

Coinbase logo