When less is more: the security of a 99% uptime guarantee
January 9, 2024
By Brian Robbins, Engineering Leader
Why two 9’s are optimal for blockchain validation
When it comes to operating validators, there is much confusion around what qualifies as “good”, particularly on the topics of availability or uptime. In part, this is due to comparisons to a similar experience in traditional web2 internet services. In such an environment, 100% uptime is clearly preferred — it guarantees that whenever a request comes in, the service is able to respond to that request. Significant engineering effort and design has gone into this problem over the past couple of decades to bring critical web2 services close to 100% uptime.
Blockchains themselves have generally solved this problem by implementing systems that require a majority, but not the totality of validators to be functioning at any given time. Almost all Proof of Stake (PoS) blockchains reward validators for being online (often referred to as liveness) through the rewards generated for producing blocks and/or confirming the validity of blocks produced by others. Many chains also have penalties for not performing those duties (often called inactivity).
As such, it stands to reason that maximizing uptime is the best way to maximize rewards. However, downtime is something that happens for myriad reasons, many of which are an expected part of validator operations. Blockchains account for this by having minimal or non-existent penalties for short periods of downtime.
So what is a good uptime for validators? Is it 95%? 99%? 99.9%?
In this post, we explain why two 9’s are better than three. Or, to put it another way, why you should be wary of any uptime guarantee greater than 99%.
The myth of 99.9% uptime
If a blockchain rewards liveness and penalizes inactivity, the assumption can be made that in order to maximize earned rewards, a validator should operate with as much uptime as possible. Taking a web2 approach would lead to a path of load balancers and machine redundancy to ensure that when a validator goes down, it can be seamlessly replaced by a backup. We believe that implementing a strategy to maximize uptime over everything else is one of the most dangerous things a validator operator could do.
The reason for this is called double-signing or equivocation. This happens when a validator (represented by a specific key) submits two signed messages attesting to different blocks at the same height in the chain. Since one of the key responsibilities of a validator is to attest to which block is valid, the blockchain has no way of identifying which block is canonical if the same validator proposes or attests to two different blocks. Many blockchains will penalize a validator for doing this by taking away (slashing) the starting principal, prohibiting the validator from continuing to be able to validate (jailing), or both.
When an operator implements live failover or backups, it means they have had to activate the same key, representing a single validator, on multiple hosts at the same time. Then, if something happens to the primary validator, the system can automatically switch over to the secondary or backup validator without any downtime or inactivity.
However, if those failover systems have an issue, it’s possible to have both the primary and the secondary validator operating at the same time, attesting to different blocks, and introducing equivocation into the blockchain. At the time of writing there have been 406 validators slashed on the ETH blockchain and most of those have been a result of some form of backup/failover strategy gone awry. The single best way to prevent this from happening is to ensure that no two machines can ever have the same validator keys active simultaneously, making it virtually impossible to operate at 99.9% uptime.
A heritage of security
Fundamentally, here at Coinbase we value security over liveness for our staking operations. This means that when we need to choose between the two, we will take the path that guarantees security of funds over those that optimize for uptime. This has manifested as myriad safeguards to protect against double-signing, the most important of which is the guarantee that validator keys can only be loaded by one, and no more than one, machine at any given time.
There are many implications to this choice. Prioritizing security means increases in downtime across a number of scenarios:
We must take a validator fully offline before we can turn on its replacement
When something goes wrong on a given validator, we will have increased downtime to either address the issue in place, or stand up a replacement after completely shutting the original down
When a validator needs to be upgraded, there will be downtime. We do not employ methods such as hot swapping validator keys for upgrades because of the inherent increased risk of slashing
Nevertheless, we strongly and unequivocally believe that this is the right choice for our clients and their end users. We do this to protect client funds, and further reduce the risk of slashing.
We also care deeply about keeping our validators and the underlying OS up to date with the latest security patches and fixes. Patching quickly is one of the best ways to reduce the vulnerability to emergent bugs and exploits. Applying these patches and updates to the validator client software we run and the underlying OS images is another place where we make the tradeoff of security over liveness. Validators that forgo the downtime necessary to process upgrades, patch software, etc. are less trustworthy and more prone to events that incur penalties like double-signing and slashing.
Due to this approach, we firmly believe that the maximum uptime any trustworthy provider should guarantee for a validator is ~99%. 99% uptime allows for 7 hours and 14 minutes of downtime every month. While that may sound like a long time, we find that most validators take 15-30 minutes from launch until they are caught up and participating in the network. As such, just three software updates in a month will result in 45-90 minutes of downtime. In contrast, a 99.9% uptime guarantee would only allow for 43 minutes of downtime per month — an unrealistically short period of time to account for the numerous updates and patches necessary on a monthly basis to safely run validators for any protocol. This also does not account for network stability issues as blockchains often have liveness and finality failures, as well as instability from ongoing upgrades.
The only way an operator can guarantee uptime of 99.9%+ for a validator is to reduce the number of times they restart the validator (mostly likely by delaying security updates and patches), or by reducing the downtime when they do patch (most likely through some sort of hot swapping backup/failover strategy with duplicate validator keys active at the same time). Both of these approaches introduce unacceptable risk that puts the validator, and the funds delegated to it, at jeopardy.
Conclusion
At the end of the day, there is no magic uptime percentage. While rewards are important, and reflect validators effectively performing their duties, what matters most to stakers is the safekeeping of their assets. That’s why we will continue to prioritize the safety of client assets and the security of the underlying network even at the expense of small reductions in rewards earned.
Visit our website to learn more or contact our sales team to begin staking your assets with Coinbase Cloud.
Disclaimer
This document and the information contained herein is not a recommendation or endorsement of any digital asset, protocol, network, or project. However, Coinbase may have, or may in the future have, a significant financial interest in, and may receive compensation for services related to one or more of the digital assets, protocols, networks, entities, projects, and/or ventures discussed herein. The risk of loss in cryptocurrency, including staking, can be substantial and nothing herein is intended to be a guarantee against the possibility of loss. Reward rates listed herein are estimates, are not guaranteed and are set by the protocol and remain subject to change. Actual rate of rewards earned may vary significantly and may be zero. This document and the content contained herein are based on information which is believed to be reliable and has been obtained from sources believed to be reliable, but Coinbase makes no representation or warranty, express, or implied, as to the fairness, accuracy, adequacy, reasonableness, or completeness of such information, and, without limiting the foregoing or anything else in this disclaimer, all information provided herein is subject to modification by the underlying protocol network. Any use of Coinbase’s services may be contingent on completion of Coinbase’s onboarding process and is Coinbase’s sole discretion, including entrance into applicable legal documentation and will be, at all times, subject to and governed by Coinbase’s policies, including without limitation, its terms of service and privacy policy, as may be amended from time to time.