is a crucial tool for evaluating the effectiveness of machine learning (ML) models and improving product development. At Coinbase, we have faced numerous challenges with traditional AB testing methods, including limited testing capacity and laborious experimentation processes. To overcome these hurdles, we developed AB universes, a structure for partitioning the user space to allow running multiple simultaneous conflicting tests.
AB universes have been used for some time in the tech industry, but we've built a uniquely elegant approach that combines universes with simple configurations for our ML engines. This provides an incredibly straightforward, yet powerful, “lego-block” approach that can be used for any highly experimented surface and often requires no new code to run an experiment.
In this piece, we explore the concept of AB universes, their benefits, and how we integrated them into our ML systems to ultimately accelerate the adoption and growth of ML-driven services at Coinbase.
Traditional AB Testing Limitations:
Traditional AB testing methods have some inherent limitations, particularly when it comes to experimentation throughput and efficiency. With traditional AB testing, it's often difficult to run multiple tests simultaneously without conflicts. This is especially impactful in the ML domain, where we often test multiple variants of the same model. When it is possible to run multiple tests at the same time, these tests still may require complex reasoning, documentation, and cross-functional alignment about how to prevent conflicts from affecting results.
Running multiple AB tests at once also often introduces a bad type of complexity into your codebase: invisible and scattered complexity. No matter how you centralize your references to experiments, you are forced to have behavior branches scattered throughout the codebase, which can have non-trivial dependencies on each other. AB testing setups can be arbitrarily complex, and in our experience, this complexity has bottlenecked development as we scale.
Finally, traditional AB testing methods can be labor-intensive, requiring setting up the experiment branch and observability, post-analysis work, and then tearing down the experiment.
The AB Universe Concept:
An AB universe is a system where the user population is divided into n groups, referred to as "slots," with each slot assigned to a single AB test. By splitting the population into multiple slots, we can run many simultaneous tests.
While this technique reduces the statistical power of each test, since we are now running over 1/n of the total population, we can increase the statistical power of an experiment by running it over multiple slots or by running it for longer.
Illustration depicting AB universes and how they enable multiple tests to run simultaneously.
While AB universes already remove a lot of complexity from the experimentation setup since we usually no longer run multiple experiments over the same user population, the true benefits come when you architect your service around this paradigm. We built our AB universes using a single configuration file to define all experiment behavior, abstracting all experimentation code from behavior code. This removed all experiment complexity from the rest of our code, and made running experiments as simple and easy as possible.
Implementation at Coinbase:
We have successfully implemented AB universes in two key ML systems at Coinbase. First, we built the ML Feed Engine using universes, which uses ML to power all recommendation systems at Coinbase — including the home feed. Second, we migrated the ML Notification Engine to use universes, which uses ML to optimize notification delivery.
These implementations have substantially increased the throughput of AB testing for both systems. We have plans to integrate this paradigm into non-ML services soon, as well as implement universe support natively in our internal experimentation platform.
All experiments run on the ML Feed Engine since its inception.
For any service where your experiments are self-contained, it is relatively easy to implement this framework and involves minimal cross-functional work. However, for systems where experiments cross service boundaries, you may need to implement support for universes in multiple services. In these cases, building an experimentation client with centralized and native universe support would dramatically simplify the adoption of this framework.
For example, when adding universe support to the ML Notification Engine, we also had to implement universes in a corresponding service that rate-limited notifications. This is because rate limits had to be proportional across experiment groups or one experiment group could take up the entire bandwidth —reducing performance across other experiment groups and invalidating comparisons.
Here is some of the key terminology we define in our AB universe-driven services:
The “components” of your service are the different parts of the service that you may want to experiment on, each of which we can individually swap out. This is like decomposing a car into an engine, chassis, transmission, etc.
A "composer" is a combination of these components that make up a full treatment. This abstraction allows us to "name" different treatments, leading to uniform and easily analyzable metrics and logs.
A “composer manager” deterministically routes a request through recursive “matchers” to choose a composer for the request. Matchers may use arbitrary logic such as checking experiment groups or flipping a coin.
An experiment is a comparison between two different composers.
Two composers with two different scorer components
In our implementation of AB universes, we create a configuration file that declares all components, all composers, and the composer manager. We recursively parse the matchers in the composer manager to accommodate arbitrary setups, such as a random session-based holdout as well as a long-running user-based holdout.
Upon startup, our service constructs all named “composer” objects using the declared components, looking up constructors for all of the components from a hard-coded list. Finally, it creates a composer manager struct from the configuration, which can route requests to the correct composer. The composer manager automatically tags all metrics and logs with the component and composer name, and fires experiment exposures corresponding to experiments that have been checked.
The configuration contains all of the information necessary to take a user_id and deterministically find its corresponding composer. Then, each composer defines all of the components to be used. The components themselves have no knowledge of experimentation or composers. This keeps all of the complexity of experimentation confined within a single file, decomposing components into only their specific behavior.
Here's a simplified example of a YAML configuration file for the AB universe system for a theoretical “Netflix Recommender System”:
This configuration can then be modified at any time, swapping out experiments in slots whenever desired. To write a new component, we simply write the code for that component, register its constructor with the composer manager, and then we can use it in our configuration in a new composer. We can then easily fill one of the slots with an experiment, testing that new composer against the status quo.
Our configurations can even take parameters for each component, allowing many experiments to be run without writing a single line of code. For example, in the car metaphor, we might design a wheel component with a parameter “wheel size.” An engineer could then run an experiment with multiple different wheel sizes, only having written one component.
This approach streamlines experimentation into an explicit and simple “lego-block” process and vastly improves the scalability of experimentation.
Hashing in AB Universes:
Hashing in a universe is quite similar to hashing in a split test, using the following formula:
slot_num = hash(universe_name + universe_version + user_id) % num_slots
However, one important aspect to consider is that universe hashing remains consistent over time. This means that historical tests might have a slight influence on the results of future tests. For instance, if a test in slot 1 performs poorly and causes all new users to stop using the app, future tests in that slot could be negatively affected.
To address this issue, we use the "universe version" to allow for rehashing the entire universe when no experiments are running. We've also explored a more advanced technique called "hashing on demand." In this approach, creating a new slot pulls a user group randomly from the entire unused user space rather than from a pre-allocated space. When the slot is emptied, the user group is then returned to the unused user space.
Diagram of the Example Yaml Configuration
Outcomes and Benefits:
By implementing AB universes in our ML-driven services, we've turned the main components of the service into easily AB-testable parts and thus allow easy iteration on any of these parts. This approach has led to the following significant benefits:
Increased throughput of AB tests: Running multiple simultaneous tests without conflicts accelerates the development and deployment of ML-driven services.
Greater flexibility and faster feedback: The ability to modify the AB setup at any time, run experiments over small user populations, and streamline development through automated observability and simple configuration shortens the implementation and feedback cycle.
Enhanced analytics: The "composer" abstraction provides uniform and easily analyzable metrics and logs, reducing boilerplate code and enabling better insights and decision-making based on test outcomes.
Centralized and cleaner code management: AB universes centralize branching and test configurations in a single configuration file, making it trivial to manage tests and enforce the cleanup of obsolete test configurations.
Empowering team members: The AB universe framework democratizes the testing process by enabling any team member to quickly prototype and run an AB test without extensive planning or stakeholder alignment, streamlining the process and encouraging a culture of experimentation.
Encouraging experimentation: The reduced engineering time and lower stakes for a single test promote "boundary tests" (tests that are not necessarily expected to outperform the status quo), which provide valuable insights into the system and uncover potential areas for improvement.
Experiments running during each month for the ML Feed Engine
AB universes have revolutionized AB testing for machine learning at Coinbase, enabling multiple simultaneous tests, offering greater flexibility, and reducing boilerplate work. This approach has significantly increased experiment throughput and accelerated the iteration of ML-driven services, and is now being rolled out to non-ML services as well.
The potential of AB universes to become a standard testing practice in the software engineering community is evident, as it empowers teams to innovate through more rapid learning, which will ultimately lead to better outcomes for customers.
Special thanks to Vik Scoggins and Rajarshi Gupta for their work on this piece and the CoinRecs team for their contributions to AB universes at Coinbase.
About Kyran Adams and Lucas Ou-Yang
Machine Learning Engineer
Engineering Manager, Machine Learning
Feb 9, 2024,
3min read time
Feb 7, 2024,
4min read time
Feb 1, 2024,
3 mins read time