Coinbase Logo

The Subtleties of Error Handling Flaws in MPC

Tl;dr: Recently researchers at Fireblocks found an error-handling flaw in a number of MPC libraries that, under certain conditions, could manifest as a significant vulnerability. While our Coinbase Wallet consumer product was not impacted by this issue in any way, the libraries in question were used in a previous version of our Wallet as a Service (WaaS) solution. However, due to Coinbase’s approach of implementing multi-layered security across our software stack, there was no way to practically exploit this issue within any of our products. We acknowledge that other organizations and companies across the DeFi ecosystem were more significantly impacted, including some instances that may have enabled a malicious client to extract the secret key from an impacted service. Coinbase immediately released updated libraries with improved error handling to eliminate this issue. We thank the researchers at Fireblocks for identifying this issue, conducting a responsible disclosure, and helping to improve the security of the ecosystem.

By Yehuda Lindell, Jeff Lunglhofer, Chief Information Security Officer

Product

, August 9, 2023

Coinbase Blog

Overview

Appropriately handling error conditions is important for any software package, but particularly so for cryptographic software. Failure to do so can, under certain conditions, create significant security issues. One famous example is in RSA OAEP encryption (standardized in PKCS#1 v2.1), where incorrect error handling in implementations of RSA in PKCS#1 v2.0 opened the door to a powerful attack. In short, by disclosing certain details in error messages it was possible to gradually discern sensitive key material, compromising the integrity of an encrypted session. This was corrected in PKCS v2.1, but serves as a powerful lesson as to how subtle flaws in cryptographic implementations can manifest.

MPC protocols, like all other software, contain integrity checks that can result in a failure if something goes wrong during a transaction. MPC protocols, as designed, guarantee security even if some of the participants are actively corrupted, and errors can be issued in the case that a party is detected behaving in an anomalous manner. In some cases, such errors require terminating the current transaction, and in other cases they require refusing to participate in future transactions until any potential security threat is eliminated.

As recently reported by the Fireblocks team, some implementations of the MPC protocol of Lindell17 failed to properly handle these error conditions (due to not following the instructions as specified in the paper). As a result, in certain conditions, this allowed for the disclosure of data that could be used to derive the secret signing key. The severity of this issue depends upon how persistent the connection was between the parties involved in a signature process. If it is possible for a single participant to rapidly make repeated MPC signing requests the difficulty of extracting the secret key is much less than in a more robust implementation where additional steps are required in order to make a new signature attempt.

Were Coinbase Product / Solutions Vulnerable?

Our Coinbase Wallet consumer product was not impacted by this issue. The MPC libraries containing the reported flaw were used in our WaaS solution but were not practically exploitable. At the time of the report, our WaaS solution was in use by a limited number of customers and we immediately implemented software updates to remediate the very low level of risks posed by this issue.

For our WaaS implementation we viewed this as nearly impossible to exploit for two primary reasons:

  1. The attack can only be carried out by a malicious server inside Coinbase infrastructure, interacting with an honest client.

  2. Upon experiencing an error condition due to an attack attempt, our WaaS implementation requires a full manual re-authentication of each individual signing request by the client. This would have to be repeated hundreds of times during any exploit attempt.

These two limitations create a very challenging set of limitations on any attempt to exploit this flaw within our WaaS product. In order for a third party adversary to try to take advantage of the flaw, they would need to have obtained full control over our key management servers here at Coinbase. Even then, the adversary would then have to trick one of our customers into initiating hundreds of fully authenticated signing requests. In each instance, the customer would need to set up the request in the client software and manually authenticate each time to approve each individual request. That process would have to be repeated hundreds of times. It is extremely unlikely that any customer would be willing to go through that tedious and manual process hundreds of times before contacting us for support.

Given the extreme circumstances required for any possible exploitation, we see little risk that an attack could be successful in the real world. Nevertheless, we take any flaws very seriously and we issued an immediate fix to the library.

Next Steps

Here at Coinbase, we are continuously evaluating our security posture - and that includes the security of the software we use and produce. We believe in the responsible disclosure of vulnerabilities and we actively engage the community to do just that. Security flaws and vulnerabilities will continue to be identified with new and legacy software for the foreseeable future. What differentiates software producers isn’t the existence of issues, vulnerabilities, bugs, etc. It’s their ability to rapidly understand, risk assess, and remediate any exposures once they are identified. 

We thank Fireblocks for their commitment to keeping the ecosystem safe and for both identifying and responsibly disclosing this flaw. Read on for a more detailed discussion of error handling in MPC implementations.

Technical Discussion on Error Handling

Error Handling in MPC

In regular software, error-handling is mainly about usability. Improper error handling can lead to non-graceful crashes which significantly impact usability. In cryptography, improper error-handling can also lead to security vulnerabilities. One famous example is in RSA OAEP encryption (standardized in PKCS#1 v2.1), where incorrect error handling in implementations of RSA in PKCS#1 v2.0 opened the door to a powerful attack. The theoretical paper for RSA OAEP stated that ciphertexts that didn’t pass certain tests need to be rejected, but PKCS v2.0 allowed implementations to “explain” in an error message which test failed (something that the paper didn’t consider). Manger showed that this tiny and seemingly insignificant change opened the door to an efficient and powerful attack on RSA-OAEP in SSL/TLS type settings. This was corrected in PKCS v2.1, but serves as a powerful lesson as to how subtle cryptographic implementations can be.

MPC protocols, like all other software and all other cryptographic software, contain checks that can result in protocol failure. In fact, the whole concept behind MPC is to guarantee security even if some of the participants are corrupted and attempting to cheat. As a result, many – if not most – MPC protocols have steps that are designed to force parties to behave honestly or be caught cheating. The question we address in this blog post is how such “I caught someone cheating” errors are supposed to be handled. Clearly, naive strategies which say – erase everything and refuse to play with this party ever again – are not viable, especially in a Blockchain setting. For example, consider the case of an end-user MPC wallet where the key is split between an MPC service provider and the user’s mobile phone. The aim of using MPC in this scenario is to provide security even if one of the parties is corrupted, and thus even if the user’s mobile device (for example) is infected by malware. If the service provider catches the user cheating and erases their key and service, then they could lose all of their money in the exact scenario where MPC is designed to protect them.

We will describe the different types of errors that typically appear in MPC protocols and how they should be handled. We will also describe defense in depth strategies that should be deployed in the context of MPC error handling. 

MPC With and Without an “Abort”

There are two main classes of MPC protocols – those that are always guaranteed to terminate successfully, and those that may abort due to malicious behavior but are guaranteed to be secure in all other aspects.* In general, it is not possible to prevent aborts in a setting where an honest majority is not assumed. In particular, this is the case for any protocol with two parties since as soon as one party is corrupted, there is no honest majority. The dishonest majority setting is the standard security model used for threshold signing protocols in the field (this doesn’t mean that a majority is mandated to be dishonest, but that security is maintained even if a majority are dishonest). This is mainly because mandating that the attacker has to corrupt all parties (or a full quorum of parties in the threshold setting) in order to break the security of a scheme is always preferable to the case where they only have to corrupt a half of the parties. Since aborts are an inherent property of dishonest majority MPC protocols, and since these are the protocols used, it is important that such aborts are properly handled.

Error-Handling in MPC with Abort

There are three main types of errors that appear in the literature (ignoring regular software errors that are not a result of attempted cheating). 

  1. Error causing a simple abort: The most common type of abort is very simple – a party detects that another party has cheated and simply halts this execution. The two main principles for dealing with this type of abort are:

    • All randomness used in this execution is to be erased, and any new protocol execution must begin from scratch with fresh randomness.

    • Unless otherwise specified, if the party aborting may have done so due to multiple different checks, then it should not send an error message with the reason they are aborting (this is the lesson from Manger’s attack on RSA-OAEP described above). Furthermore, any log of the specific error (needed for debugging) should be treated as sensitive. In many cases, there is actually no problem sending what the error was. But this needs to be explicitly specified by protocol designers in order to avoid an error that leaks information in a protocol where the reason for the error may not be revealed.

Note that most MPC aborts are of this kind, and unless the paper/specification states otherwise, this is the way the abort should be treated. Examples of this type of abort in threshold signing protocols appear in [HLNR18, DKLs19, CNP20], for just a few examples. We remark that an error causing a simple abort should not just be ignored. Indeed, it means that one or more of the parties have behaved maliciously or have been corrupted by an attacker, and this should be investigated in order to remove any infected software. Nevertheless, there is no need for any special error handling at the MPC protocol layer for such aborts. 

2. Error causing a total abort: A less common type of abort, but one that exists in the literature, has the property that if Alice catches Bob cheating, then she must both abort this (or any concurrent) session as well as refusing to play with him again in the future. This type of abort typically happens when the mere fact that the protocol aborted can reveal some information to the adversary. Since the question of the abort happening or not can reveal at most one bit, this does not impact security in any way. However, if a single bit is revealed enough times (e.g., 150-200) then this can suffice for extracting the secret key. As a result, if cheating is detected in such protocols, the honest parties must refuse to run future executions as well. Below, we will discuss how such errors should be dealt with in the context of Blockchain wallets. 

Examples of this type of abort appear in the two-party ECDSA protocol of [Lindell17], as well as in protocols for dual execution (e.g., [MF06, HKE12]). Note that dual execution protocols in general leak up to a bit if a party cheats. In some settings, this cannot be allowed. However, when using such protocols where the only private input is a cryptographic key, leaking a single bit causes no harm (it can be guessed anyway). Thus, dual executions may be used in this setting (e.g., for key derivation). However, one must limit the number of executions where cheating took place, or the key may be leaked.

3. Detected error that cannot be revealed: A third type of error which is rare but also exists in the literature is one where in the case that Alice detects Bob cheating, she must not reveal that fact now or even in the future. These protocols have the property that Alice can recover and continue the execution even if Bob cheated, but revealing to Bob that she caught him cheating can leak a bit of information. As a result, if leaking a bit is unacceptable, then Alice must continue this execution, and even run future executions, even though she knows that Bob is corrupted, and no log or record of the cheating attempt should be revealed. Of course, if leaking a bit is not a problem (like for cryptographic keys as described above), then one may prefer to notify the user that they have been corrupted at the price of leaking a bit. Nevertheless, in this case, if one does not notify the user, then nothing is leaked. To the best of my knowledge, protocols with this type of error have not been proposed for practical use. Protocols with this type of error appear in [Lindell13, Brandão13].

Identifiable aborts: There is another category of aborts that are called identifiable aborts. These are the same as above, except that each party who detects cheating knows unequivocally which party cheated. This can be useful for determining how to act after a cheat was detected, but is otherwise the same as above. We note that in the two-party setting, all aborts are identifiable, since each party knows that if they are honest then it must be the other party who has cheated. Having said the above, unless a protocol with identifiable aborts provides a proof of cheating that can be verified by others, this may not be very helpful. For example, consider again the case of an MPC wallet with a key shared between the service provider and the user’s mobile device, and assume that the user’s mobile device has been infected with malware and is running an active attack on the protocol. The server will issue an error message that it was attacked. However, after the fact, it isn’t actually clear if the mobile device was compromised and attacked the server, or if the server was compromised and its attack was to lie and say that the user’s mobile device was cheating. Without a mathematical proof of cheating, it isn’t necessarily possible to differentiate between these two settings. In practice, if cheating is ever detected, then forensics can be used to verify what happened (unless the protocol generated a verifiable proof of who cheated, in which case the cheating party can be immediately isolated).

Cheating vs. Crashing? What is the Difference?

In some cryptographic protocols, the mere fact that a corrupted party halts midway can constitute an attack. I will describe two examples. In the first example, the protocol may use something called “cut-and-choose” where one party sends a bunch of encrypted values (of some specified format) and the other party verifies that they were constructed correctly by asking the first party to open a random subset of them to be checked. If the first party created some incorrect values and these were chosen to be opened, it can then halt without opening them and claim that it halted due to a network or local error. It can therefore avoid detection and can run again, bypassing the security check. In the second example, consider two parties Alice and Bob who wish to remotely toss a coin in order to determine who gets to choose which movie they are going to watch. A protocol for this can involve Alice sending an encrypted bit to Bob (or technically, a commitment), and then Bob sending a bit back to Alice, who finally decrypts showing Bob the bit that she originally sent. They agree that if Alice and Bob sent the same bit then Alice gets to choose the movie, and otherwise Bob gets to choose the movie. This is secure (with each getting to choose with probability ½) since Alice sends her encryption before Bob chooses his bit and cannot change it afterwards (this assumes something called committing encryption), and Bob chooses his bit after seeing Alice’s encryption but before it was opened and so his bit is also independent of Alice’s bit. Since independent random bits equal each other with probability ½, we have the desired property. However, since Alice sees Bob’s bit before she opens her encryption, nothing stops her pretending to crash if the result isn’t to her liking, and then asking Bob to run again from scratch. 

In both of the above examples, pretending to crash can provide an advantage to a malicious attacker. The problem is that it’s not possible to distinguish between a real crash and a cheating strategy. As a result, when this may matter, the application using the MPC protocol must take it into account. We stress, however, that in most MPC protocols, this is not an issue. For example, any protocol with simple aborts (as defined above) is not affected by this issue. In contrast, some protocols with total abort may be affected (e.g., dual execution protocols like [HKE12]), while others are not (e.g., the two-party ECDSA protocol of [Lindell17] is not affected, and aborting does not have any security impact).

Distinguishing Cheating and External Tampering

Since cheating detection can have ramifications (clearly this is the case with total aborts, but even simple aborts will need handling at the application layer and potentially investigating if a user has been compromised), it is important that external tampering with messages sent from one party to another does not result in an MPC error. This can be achieved simply by using authenticated encryption between all parties (our recommendation is to use standard up-to-date TLS, with a fixed ciphersuite) and to keep TLS errors separate from MPC errors. This means that an error decrypting a TLS message will never reach the MPC layer, and so will be interpreted in the same way as a message that just never arrived.

Total-Abort Error Handling and System Design

As we have described above, MPC protocols with total-aborts require special error handling. In particular, once an error of the prescribed type has been detected, all current executions must be halted and no executions may be run in the future. This within itself has challenges. In applications where many executions with the same key are run independently (say where a key is split between independent pairs of servers in order to increase throughput), halting all concurrent executions can be very difficult.** In contrast, in a Blockchain wallet, it is relatively easy to halt concurrent executions since they all take place from the same wallet (or there may be a clone, but there would not be many of them). However, there are still other challenges. For example, consider a server who detects that a wallet has been cheating. Would the service-provider running the server close that user’s account without any recourse? Of course not, that would be absurd! Rather we would understand that this case most likely means that the user’s mobile device or laptop was infected by malware. As such, we would temporarily halt transactions while actively helping them to ensure that they are now clean. After that, we would reactivate their account and allow them to resume carrying out transactions. But what happens if they are infected again, and again, and again? Each time a bit would be revealed, and at some point (maybe after 10 times or 20 times) we would need to transfer all of their funds on-chain (at which point new keys are used and so any previously learned bits are rendered meaningless). 

As is clearly seen, dealing with these types of errors is very dependent on the application. We will therefore proceed by explaining the exact error type in the two-party ECDSA protocol of [Lindell17], and how we designed WaaS to deal with it. The protocol of [Lindell17] is for two parties – let’s call them Alice and Bob – and works with Alice sending the first message to Bob, Bob then replying, Alice sending a second message, and finally Bob sending the last message to Alice, who obtains the signature. In this last message from Bob to Alice, Alice needs to verify that the resulting signature obtained is correct, and if not, she needs to abort and to abort all executions (since Bob has certainly cheated, and a bit of information may be leaked to Bob by him simply knowing if Alice aborted or not). We stress that only Bob is able to carry out this attack, and any error caused by Alice cheating can only result in what we called a “simple abort” above; in particular, no special error handling is needed.

The main question to be addressed when deploying this protocol in a wallet application with a server and user device, and in particular for WaaS, is who plays Alice and who plays Bob. Our design choice was for Alice to be the user device and for Bob to be the Coinbase server. There were a number of reasons for this. First, the chances of a user device being compromised (e.g., infected with malware) is far greater than the chance that the Coinbase infrastructure is compromised to the point that malicious code is being run. This means that the chances that we will have to deal with total-abort error handling at all is very low. Second, it is much easier to halt all executions of a user wallet on the wallet side, since it is a single operation running on the user’s device (and even if there are some clones, there are not too many). In contrast, on the server side, there may be multiple instances running concurrently on different virtual machines, and a sophisticated client attacker may be able to initiate many concurrent sessions on purpose for its attack. Third, in the case of a detected error, with our deployment choice, the user detects the error and will be issued a clear error message that the signing failed and the wallet is locked (at least, that was the design intention; see below). This makes any attempted attack from inside Coinbase (e.g., if it were ever so compromised) very noisy and so would very quickly be publicly noticed. It is of course possible to also issue emergency errors within Coinbase if the Coinbase server was Alice, but this depends on systems and processes external to WaaS, and it is preferable to have a design with as little external dependence as possible.

As it turns out, we ourselves made a mistake in our error handling in this case. As pointed out by the researchers at Fireblocks, our client (who plays Alice) indeed issues an error. However, that error did not actually result in the WaaS client locking and refusing to participate in future executions (one could rely on the higher-level application implementing this lock since WaaS is an SDK and not a full app; however, security measures are better enforced at the level of the WaaS library). Fortunately, due to our deployment strategy described above, this did not lead to a practically exploitable attack. The reason for this is that only the client can issue a signing request (not the server), and the user would need to authenticate from scratch each time, including biometric authentication to unlock the key share on their device. This means that the only way that an attack could be carried out if Coinbase was completely compromised would be for a user to attempt to sign and fail in their attempt, and repeat this over 100 times. This would be like attempting to login with your password and trying over 100 times before giving up and contacting support, a case which is just not realistic. Beyond this making the attack not practically exploitable, it also means that any attempt to carry out the attack would be very noisy, as described above, and so would be detected very quickly and dealt with. The fix was very simple, and involved locking the client from future executions in the case that the error in question was received. This suffices for completely removing any risk, as the error is now dealt with as a total-abort.

We stress that had we deployed the protocol in the reverse direction with Alice being played by the server, then this would have opened the door to a malicious client completely extracting the secret key. The reason why that would be possible is that the server typically responds to any client request after authentication, and so a malicious client could issue a couple of hundred of signing requests, and carry out an attack that leaks a single bit each time.

It is worth asking how this oversight occurred. We knew what needed to be done, and had written it explicitly in the paper itself. Our retrospective study found that we had a mismatch in expectations at different levels of the code. The cryptography team (and independent security reviewers) oversaw the MPC engine, carrying out a review of the code and verifying that it matches the theoretical paper and specification. Indeed, upon inspecting the code, it is possible to clearly see that an error is issued. We therefore assumed that this meant that all executions would be halted. We learned some very important lessons here. First, security-related error handling should be handled as much as possible within the cryptographic code. Second, where this is not possible, it is crucial for the cryptographic review to also include these other parts of the code. Finally, the cryptographic specifications need to be extremely explicit about which type of error is being issued and how it should be handled. This wasn’t done in the specification in this case. In a broader view, we have now issued a process where the cryptography team works closely with engineering in order to understand the execution environment in its entirety, in order to ensure that there is no mismatch in expectations overall. In this case, we were fortunate that the flaw wasn’t practically exploitable. 

Total Versus Simple Aborts

Based on the above, it is clear that MPC protocols with simple aborts are preferable since they require less delicate handling. However, as long as the application is able to take care of total-abort error handling, there can be advantages. In particular, the two-party ECDSA protocol of [Lindell17] has many advantages over other protocols in terms of its very low bandwidth (and efficient computation), as well as its relative simplicity (which is itself an important goal). The choice of which protocol to use therefore very much depends on the needs of the application, and its ability to carry out proper error handling.

MPC in WaaS

For more information about the use of MPC in WaaS and its design, please see our Cryptography and MPC in Coinbase Wallet as a Service white paper.

* It’s a bit more subtle than this, and there’s also a question of fairness that relates to the question as to whether it’s possible for a corrupted party to obtain output while others don’t. For the sake of this post, we focus on the setting where some parties may abort while others may receive output, as is the standard setting for the case where an honest majority is not guaranteed. ** Of course, it isn’t actually essential to halt after only one error. Since each error can leak one bit of information about the key, it is possible to bound the amount of information leaked over multiple executions. Thus, if an application is such that at most 10 independent executions can be run, then the number of leaked bits is bounded by 10, which clearly isn’t sufficient to learn anything.

Coinbase logo
Yehuda Lindell

About Yehuda Lindell

Yehuda Lindell leads the cryptography team at Coinbase and is a professor of Computer Science at Bar-Ilan University (on leave). At Coinbase, Yehuda is responsible for the company’s cryptography design and its strategy around secure multiparty computation (MPC). Yehuda obtained his PhD from the Weizmann Institute of Science in 2002 and spent two years at the IBM T.J. Watson research lab as a postdoctoral fellow in the cryptography research group. Yehuda has carried out extensive research in cryptography, published over 100 scientific articles, and co-authored one of the most widely used textbooks on modern cryptography. Prior to joining Coinbase, Yehuda was the co-founder and CEO of Unbound Security, a company that provided key management and protection solutions based on MPC. Unbound was acquired by Coinbase at the end of 2021.