Coinbase Logo

Scaling Detection and Response Operations at Coinbase Pt.1

Tl;dr: At Coinbase, we aim to be the most trusted and secure place to interact with the cryptoeconomy in support of our mission to increase economic freedom. To help achieve this aim we have a dedicated team - the Computer Security Incident Response Team (CSIRT) - who hunt, detect and respond to security threats targeting Coinbase systems and data. In this three-part blog series, we’ll cover some of the strategies and systems that the CSIRT has implemented at Coinbase to investigate and respond to threats more effectively, scale our detection and response operations, and ultimately to keep Coinbase the most trusted and secure cryptocurrency exchange.

By James Dorgan

Engineering

, September 8, 2023

Coinbase Blog

Detection and response teams invest a lot of time ensuring that they have suitable detection coverage for all of the applications and systems that need to be monitored within their environment. With modern security information and event management platforms (SIEMs) being able to ingest, process and store ever increasing volumes of data, analysts are responding to a higher volume of alerts from more data sources than ever before. 

This increase often leads to analysts being overwhelmed and unable to efficiently triage and respond to the number of alerts that are generated on a daily basis. This inevitably leads to alert fatigue, increased mean time to detection & response, increased analyst turnover, and ultimately to a degraded detection and response service.  

The typical sentiment around alert fatigue is that the root cause are rules that need to be tuned or disabled. Although poorly performing rules are often the main contributor to alert fatigue, they’re not the only reason why security teams become overwhelmed with alerts as new data sources are onboarded and new detections are deployed. Even with a mature threat detection development life cycle that involves regularly reviewing, tuning and disabling low-fidelity rules, it’s possible that the number of alerts being generated still exceeds your security team's alert budget. Once you exclude low-fidelity detections, a key reason why detection and response teams struggle to scale is because their mean time to triage is too high. In other words, it takes too long per alert for an analyst to perform a meaningful investigation to determine whether the source of a detection is a false positive or a true positive. 

So, how do we go about reducing mean time to triage? How do we help our detection and response teams to operate more efficiently, increase their alert budget, and ultimately allow them to scale more effectively? Over this three-part blog post we’ll talk through some of the strategies and systems that we’ve implemented within the CSIRT at Coinbase to help scale our detection and response operations.

To kick things off, let’s talk about why we decided to build our own centralized investigation and response platform!

The requirement for an Investigation and Response Platform

Screenshot 2023-09-08 at 3.10.53 PM

Let’s consider the example alert above and think through some of the questions that an analyst might ask while investigating this alert.

  • What recent logins does this user have?

  • What IP addresses and locations have been previously associated with this user?

  • What team does the owner work in, and where are they located?

  • What devices are assigned to the user?

  • When was the last time each of these machines were online?

  • What recent alerts have we had for the machines and the owner’s account?

These are only a small subset of the questions that an analyst will typically try to answer while investigating this type of alert, with the resulting information adding context to the investigation and helping the analyst understand the scope, impact and determination of the alert. However, even with this small set of questions, we can see how this simple alert could absorb a lot of analyst time to triage if relevant contextual information is not easily available. 

This contextual overhead for investigating alerts is made increasingly time-consuming and complex by the number of different logs and systems that need to be queried. For example, with the questions listed above we could assume that an analyst would need to query:

  1. SSO Logs

  2. 2FA Logs

  3. Human Resources Tools

  4. Asset Inventory Systems

  5. EDR Logs

  6. Historical Alerts and Incidents Databases

How long does it take to access and query each of these data sources? How often does an analyst forget to check a particular tool leaving important investigative information left uncovered? What is the impact on analyst burnout when they have to repeat this process for every alert they investigate? 

It’s not uncommon to see analysts with an incomprehensible number of tabs open in their browser while they’re investigating an alert, with each tab displaying a different tool, data source, or SIEM query. Analysts have to continually pivot between these data sources in order to build an overall understanding of what they’re investigating. 

All of these issues increase mean time to triage, result in unneeded complexity, and ultimately limit a detection and response team’s ability to scale. 

Centralized Querying of Contextual Sources 

To tackle this issue at Coinbase, we built an investigation and response application that brings contextual information from common data sources into one central location. The goal was to have a single application that our analysts could use to query an investigative starting point (i.e. a serial number, an email address, an IP address etc.) and be able to quickly view and explore related contextual information from all of our data sources:

Screenshot 2023-09-08 at 3.12.22 PM

Upon querying for a given investigative starting point, this application uses a series of predefined queries to enrich the data point with relevant information. For example, when we search for my email address, the application will retrieve, format and present a wide range of information relating to my user entity at Coinbase:

Screenshot 2023-09-08 at 3.13.01 PM

In the screenshots above you can see that by providing just my email address, an analyst can immediately see who I am, whether my account is active, what team I work in, what devices I have, my login history, historical detections relating to my user entity, what applications I have installed, my related IP addresses etc. All of this information is pulled each time an analyst performs a search and allows them to see key relevant information in a matter of seconds.

If we think back to the example alert that we looked at earlier in this blog post, we can see how all of the questions that we identified can be answered by navigating through the series of tabs visible in the screenshots above. Where it may have previously taken 10 - 20 minutes to answer these investigative questions, it now only takes a few seconds. 

You can also see by looking at the navigation pane on the left side of the screenshot, there are similar investigative flows available for other tasks that our detection and response team regularly performs. For example, searching across the devices on our estate for specific files, hunting for unusual behavior on an endpoint, or reviewing cloud activity. Every time we identify a new workflow that we’re performing manually on a regular basis, we aim to centralize and automate these tasks within this single application to decrease the time to triage for subsequent investigations.

Beyond providing a central location to query log sources, building a core investigative and response platform has also allowed us to solve a number of other problems that were limiting our ability to scale: 

Shared Investigative Baselines 

If you rely on your analysts to retrieve information manually from each data source - even if that’s just writing different queries within a single SIEM - there’s often a considerable difference between how individual analysts investigate the same alert. This often leads to a mis-matched investigative capability with analysts potentially arriving at different conclusions while investigating the same alert. This could be down to something as simple as forgetting to check a specific log source, or even a particular field within a log entry.

By building a platform that automates the majority of common investigative tasks, we’re able to ensure a high-quality shared investigative baseline across our detection and response team. This means that our analysts are more efficiently able to transfer knowledge on how to most effectively query our data sources. For example, one analyst can build a query, integrate it into the application, and then every other analyst is able to benefit immediately from that new query from a centralized location.

Decreased Analyst Onboarding Overhead 

When a new analyst joins a detection and response team, they have to learn what data sources are available, the most effective way to query them, and how to pivot between them in order to investigate alerts efficiently. This overhead leads to new analysts often having a lower investigation standard, increased time to triage, or being completely unable to investigate alerts until they’ve had time to shadow other team members and slowly upskill themselves. 

While it’s still incredibly valuable to invest time digging into new datasets to understand what information is available, by centralizing common queries into a single application we’ve been able to significantly reduce both the overhead of enrolling new analysts into the team, and also the overhead of familiarizing the whole team to new data sources as they are onboarded.

Screenshot 2023-09-08 at 3.13.57 PM

Unified Response Workflows

So far we’ve discussed how we’ve helped to scale our investigative operations, but we haven’t touched on response capabilities. As with our investigations we wanted to avoid having our analysts having to pivot into multiple systems to carry out response tasks. 

Let’s imagine you’ve identified a threat actor who has achieved remote access to an employee machine on your estate. To fully contain this threat you would likely need to:

  1. Network isolate the compromised employee’s device(s)

  2. Suspend the compromised employee’s account

  3. Revoke all active user sessions for the compromised user account

  4. Update the incident ticket and incident channel to reflect the actions taken

To complete each of these tasks, an analyst would need to log into multiple systems (SSO, EDR etc), copy information between them, have knowledge of how to drive each system, and then manually document each action taken in a ticketing system. This process takes time, requires familiarity and experience with each system, and depending on the scale of an incident could be very repetitive. 

To solve these issues, we also built response workflows into the same centralized application. An analyst can select a pre-built response task (e.g. Suspend and Clear Sessions for Users), select one or more targets, and execute the response task from within the same application where they investigated and identified the threat.

Screenshot 2023-09-08 at 3.14.36 PM

As an added bonus the investigation and response platform is linked to our incident management systems, so as a response task is executed by an analyst the relevant ticket and incident channel is also updated with the action taken, the author or the action, the targets, and the timestamp. This not only removes the burden of documentation from the analyst, but it also ensures that we have a complete audit trail of all response actions carried out by our analysts.

Closing Comments

Deploying an in-house platform that ties together security tooling and log sources has significantly improved our ability to scale detection and response security operations at Coinbase. Repetitive investigation and response tasks that used to take 10 - 20 minutes per alert now take a matter of seconds, while the overall quality of our investigations has also improved.

Although there is an argument that a fully configured XDR system could have provided a similar pre-built solution, we found that for our use-case there was no solution that offered everything that we needed. We either found that certain integrations were not supported, or there was a lack of flexibility and customization options, or at the very minimum we would be building a dependency on a single vendor which inhibited our ability to develop our own capabilities.

Building a platform that sits in-front of most of our log sources and security tools also has the benefit of abstracting complexity and vendor changes from security analysts working on day to day security operations. If an analyst wants to perform network isolation on an endpoint, they execute this response workflow directly from our central platform instead of logging into the EDR vendor platform. When we moved to a new EDR vendor we were able to simply update the existing response workflows to leverage the new EDR’s API, resulting in a change that was completely abstracted from the analyst. Although they are obviously aware that we had changed EDR vendor, there was no change or disruption to how they would isolate an endpoint.

In the following two parts of this blog series, we’ll be covering other ways that we’ve helped to scale detection and response operations. Part two will cover how we build context directly into our alerts and detection logic, and Part three will cover how we automate some of our detection and response workflows. 

Coinbase logo