Coinbase Logo

Scaling Detection and Response Operations at Coinbase pt3

Tl;dr: At Coinbase, we aim to be the most trusted and secure place to interact with the cryptoeconomy in support of our mission to increase economic freedom. To help achieve this aim we have a dedicated team - the Computer Security Incident Response Team (CSIRT) - who hunt, detect and respond to security threats targeting Coinbase systems and data. In this three-part blog series, we’ll cover some of the strategies and systems that the CSIRT has implemented at Coinbase to investigate and respond to threats more effectively, scale our detection and response operations, and ultimately to keep Coinbase the most trusted and secure cryptocurrency exchange.

By James Dorgan

Engineering

, September 22, 2023

, 4min read time

Coinbase Blog

In part one and part two of this blog series we’ve discussed a number of the tools and systems that we’ve implemented within the security operations team at Coinbase to help scale our detection and response operations. The first two blog posts focused on methods of centralizing contextual information from various log sources and systems to reduce the mean time to triage for our analysts and to improve our ability to tune and develop effective detection rules. 

In this third and final part of the blog series, we’ll be exploring two topics: Firstly we’ll be covering how we’ve built a process to bring individual employees and whole teams into the alert triaging process; building additional contextual information into our detections and reducing the number of alerts that our security team has to triage on a daily basis.

Secondly we’ll be discussing an approach we’ve implemented to reduce our mean time to respond for a subset of our high-severity alerts. Specifically this section of the blog post will focus on how we leverage automated response workflows to rapidly mitigate risk while our security analysts investigate the underlying behavior which triggered the alert. 

Gathering Investigative Context with SecurityBot

While reviewing our alert triaging process, we found that our analysts regularly lacked the context around why the underlying activity for the alert had taken place. To help explain this issue, let’s consider an example alert where an access key was created for an IAM user in a sensitive AWS account. This could either be extremely concerning from an incident response perspective, or just as easily it could be authorized activity relating to someone’s job role. In either case, the analyst responding to the alert will often lack the context as to why the access key was created so they will have to contact the end-user or responsible team to ask if they recognize the activity.

Ultimately users and teams are often the only entities that truly understand the context of why a set of events took place. While analysts can query log sources and systems to make educated decisions on the determination of an alert, sometimes it’s faster and more accurate to simply ask the relevant user or team to provide additional context. 

For example, imagine the following two alerts:

Alert #1

“An anomalous SSO login succeeded for the user e.user@coinbase.com originating from a new device “John’s Macbook Pro”

Alert #2

“e.user@coinbase.com reported that they did not recognise a successful SSO login originating from a new device: “John’s Macbook Pro”.

Although the wording of these two alerts is similar, the contextual difference is substantial. For the first alert an analyst would either need to reach out to the user to ask if they recognise the new device, or attempt to query and leverage various data sources to determine if the alert is a true positive. In comparison, with the second alert the analyst already knows that the end-user does not recognise the login activity, so it is more likely to be a true positive, and therefore the analyst can immediately begin responding to the potential threat.

As covered in previous blog posts, by bringing relevant users and teams into the alert triaging process you can increase the fidelity of alerts, automatically resolve false positives, and ensure that relevant context is automatically added to an alert before it is delivered to an analyst for investigation.

To achieve this at Coinbase we built a Slack application that is integrated with our detection and response stack. On a per-rule basis our analysts can codify an integration with the Slack application which allows them to specify a target user or team, a question to be asked, and how the application should behave depending on the response it receives:

Screenshot 2023-09-22 at 9.28.21 AM

The screenshot below shows what the example SecurityBot integration shown above would look like to the end-user if the detection rule triggers. We can see from this example how the user is brought into the triaging process for an alert relating to their account, presented with contextual information relating to the alert, and then asked to make a determination:

Screenshot 2023-09-22 at 9.28.57 AM

Upon receiving the Slack message from SecurityBot, the user can either:

  • Confirm they recognise the activity - the user will be asked to satisfy a 2FA challenge, which if successfully completed will cause the alert to be resolved.

  • Confirm they do not recognise the activity - the alert will then be escalated to the security team.

We can also leverage the same Slack application to gather information in addition to - or instead of - asking for a determination. For example, if we detect an abnormal change to a service and we want to validate if this change is authorized, we can programmatically reach out to a team or user and ask them to provide the JIRA ticket or Github PR where this work was tracked:

Screenshot 2023-09-22 at 9.41.00 AM

In addition to helping scale our detection and response team by automating the common investigative flow of asking for user context, bringing user’s into the alert triage process also has another significant benefit: Offloading the triaging process to end-users allows our security team to utilize threat detections that would previously have been too noisy for production.

For example, let’s imagine that we want to detect every time a user logs in from a new device. This behavior happens all the time at Coinbase with employees being onboarded, receiving new laptops, or utilizing test devices. The number of alerts that this detection would generate per day would quickly exhaust our security team’s capacity if we tried to manually triage every resulting alert. However, if we break down the total number of alerts per user, a typical user only triggers this detection a few times over a multi-year period. Therefore, by asking each user to triage this alert when it triggers, we cause minimal disruption to the end user while saving a significant amount of triaging time for our security team.

Other similar alerts that would historically have been too noisy for production but could now be leveraged include:

  • New 2FA devices being enrolled on a user’s account.

  • Users completing a password reset workflow from new source IP addresses

  • Users logging in from ‘impossible travel’ locations

  • Attempts to modify local security policies on managed endpoints

  • Users attempting to setup broad email forwarding rules

  • Users attempting to install low-prevalence software on their corporate devices

From the example use cases listed above, it’s clear to see the benefits of bringing users and teams into the workflow of triaging relevant security alerts. However, it’s worth noting that there are several trade-offs that need to be balanced with this approach. Firstly, offloading triaging to end-users can potentially lead to incorrectly triaged alerts as we’re assuming that end-users understand and know how to analyze the information that’s presented to them. To mitigate this issue, we try to measure the risk of a missed true-positive detection against the overhead of triaging each alert manually. We also avoid using this style of automation for high-severity alerts that require an immediate response from the security team. 

The second trade-off is if you offload a time-critical detection to an end-user there’s a risk that there may be a delay in getting a response if the user is offline or unavailable. To counter this risk, we utilize a “timeout” feature that can be configured on a per-rule basis. If an end-user or team does not provide a response within the specified timeout period then the alert is automatically forwarded to a security analyst for triaging.

Automated Response Workflows

Identifying and tuning poorly performing detections is an important step in reducing the number of alerts that security teams have to triage. However, beyond just identifying poorly performing rules it’s also valuable to identify rules that are both high-fidelity and that target high-risk behavior. These are the alerts that are reliable and when they do trigger your team makes an effort to prioritize them because they identify high-impact behavior that poses an immediate threat. For example, such alerts could include:

  • Successful SSO logins from known attacker-controlled infrastructure

  • Endpoints connecting to C2 domains relating to an active incident

  • High risk software for credential dumping, exploitation, enumeration etc.

These types of alerts are prime targets for response automation to speed up both mean time to triage and mean time to response. An automatic response task could be as simple as automatically retrieving a file from an endpoint, or retrieving the last 12 hours of browser activity and attaching it to the alert in your case management system. However it could also be a response task that takes a more disruptive action such as suspending a user account, or isolating an asset until a full investigation can be performed. 

To increase our ability to scale, as well as driving down our mean time to response and mean time to triage for our investigations, we try to identify and target these higher fidelity rules for response automation. The type of automation that we deploy depends on what we’re trying to achieve with a given detection, whether that be to decrease the triaging time by building in additional context, or mitigating the risk posed by a potential threat by suspending an account until we can fully investigate an alert. The diagram below shows the types of alerts that we try and target for automatic remediation:

Screenshot 2023-09-22 at 9.42.03 AM

Let’s consider an example alert. Imagine that it’s a holiday, weekend, or another time when your security team might have limited resourcing. You’re the on-call analyst and you receive a notification that the following high-priority alert has just triggered:

Screenshot 2023-09-22 at 9.42.36 AM

Assuming that the alert is accurate, we now know that an adversary likely has some level of access to our internal environment and could be attempting to perform any number of post-exploitation activities, including exfiltrating sensitive information. If your on-call analyst is away from their desk (maybe you have a 30 minute SLA on the weekends) it’s possible that a capable adversary could exfiltrate information from your organization before the analyst is able to isolate the compromised user’s account. However, if we have automatic remediation set up for this detection rule to isolate the compromised user’s account, then we can quickly mitigate the immediate risk of the user’s account being compromised until our analyst can perform a full investigation.

This style of automated response task is one that we’ve had particular success with at Coinbase. In our detection rules our analysts have the ability to codify automatic response tasks that will be carried out once the detection rule triggers:

Screenshot 2023-09-22 at 9.43.08 AM

After an automatic response task is completed, the status of each response task is attached to the alert in our case management system. This helps to both keep our analysts informed of what remediation steps have already been carried out as they investigate the alert:

Screenshot 2023-09-22 at 9.43.36 AM

For the example response task shown above, when the detection rule triggers the affected user account will be signed out of all systems across Coinbase and the user will be forced to go through a password reset before they are able to regain access. The password reset process can be configured to either allow a self-service reset on the next login - as long as 2FA is satisfied using a hardware token - or after additional screening to validate the identity of the entity before access is restored:

Screenshot 2023-09-22 at 9.44.06 AM

Closing Comments

While some of the tools and systems that we’ve discussed throughout this blog series may not be immediate drop-in solutions that can be deployed in every organization, the high-level ideas and approaches should be applicable to the majority of detection and response teams who are looking to scale and improve their operations. 

As you’ve been reading, you might have questioned why we decided to build so many of these solutions in-house as opposed to buying and implementing a pre-existing solution. While there are certainly use-cases where buying a product is the most effective way to enhance and scale the capabilities of a detection and response team, we found that pre-built solutions often lack some of the integrations, features, and flexibility that we require. Pre-built solutions are often an effective method to get a quick and substantial improvement to an identified problem space, but the benefits that they add are limited due to the product typically being generic in nature. A pre-existing solution may be able to offer an 80% solution, but once you’ve reached a level of maturity where you’re trying to close the remaining 20% gap, building an in-house solution often becomes the most effective route forward.

Over this three-part blog series we’ve discussed some of the strategies and systems that we’ve implemented at Coinbase to scale and improve the efficiency of our detection and response operations. These improvements feed into our commitment to be the most trusted and secure place to interact with the cryptoeconomy in support of our mission to increase economic freedom.

Coinbase logo