eth2 insights: validator effectiveness

January 26, 2022

In our eth2 Insights Series, Elias Simos discusses the parameters governing validator effectiveness in eth2, and how validators were distributed along those in Medalla

Coinbase Cloud discover validator effectiveness

Our eth2 Insights Series, presented by Elias Simos, Protocol Specialist at Coinbase Cloud, continues with some insights he uncovered as part of his joint research with Sid Shekhar, Blockchain Research Lead at Coinbase: eth2 Medalla — A journey through the underbelly of eth2’s final call.

In this post, Elias discusses the different parameters that govern validator effectiveness in eth2, proposes a method of ranking validators, and examines how they were distributed along those in Medalla.

Key highlights and findings

  • This post introduces a set of optimizations on the approach sketched out in eth2data.github.io by — (i) normalizing the proposer and attester effectiveness scores, and (ii) aggregating the two in a “master” score of validator effectiveness.

  • Overall, validator effectiveness as defined here appears to be an excellent predictor of ETH rewards potential in eth2.

  • We expect that the distribution of validator effectiveness and rewards earned in Mainnet will vary greatly.

  • The available client-specific data that we collected was not sufficient to enable strong conclusions on how client choice can affect performance.

  • What is clear, however, is that client choice has only so much to do with validator performance. Given the large variability within identified client groups, performance seems to be largely determined by the strength of the operator’s design choices.

Introduction

Validators are the key consensus party in eth2. They are required to perform a range of useful tasks — and if they perform their duties correctly and add value to the network, they earn rewards. 

Validators in eth2 are rewarded for proposing blocks, getting attestations, and whistleblowing on protocol violations. (In Phase 0, things are a little simpler as whistleblower rewards are folded into the proposer function.)

While the rewards for block proposal far exceed those of attesters per unit, because every active validator is called upon to attest once per epoch, the bulk of the predictable rewards in Phase 0 will come from attestations — at a ratio of approximately 7:1 against proposer rewards.

Taking into account the protocol rules that govern rewards, it quickly becomes apparent that not all validators are made equal — and so the distribution of rewards across validators will not be uniform.

It follows that, if validators are interested in optimizing for rewards — which is what an economically rational actor would do — there is merit to thinking of their existence in eth2 in terms of their “effectiveness.”

Proposer effectiveness

In the grand scheme of things, being a proposer in eth2 is like a “party round” for validators. The protocol picks 32 block proposers in every epoch and tasks them with committing attestations (and, later, transaction data) on-chain and working toward finalizing a block.

The probability of becoming a proposer in Phase 0, all things being equal, is then 32/n, where n is the total number of active validators on the network. As the number of network participants grows, the probability of proposing a block diminishes — and so it did in Medalla.

BT Placeholder

Figure 1: probability of proposing a block in every epoch in Medalla

In Phase 0, a proposer is rewarded by the protocol for (i) including pending attestations collected from the P2P network, and (ii) including proposer and attester slashing proofs. The more valuable are the attestations that the proposer includes, the higher their reward will be.

As a result, there are two key values through which we can score proposers on an effectiveness spectrum: first, how often they actually propose a block when they are called upon to do so; and second, how many valuable attestations they manage to include in the blocks they propose.

From these two, what a proposer can more effectively control is how many blocks they actually propose from the slots they are allocated by the protocol: a function of their overall uptime. 

The number of valuable attestations a proposer includes on a proposed block depends not only on the proposer, but also on attesters sending their votes in on time and aggregators aggregating valuable attestations for the proposer to include.

A coarse way, then, to score for proposer effectiveness is a simple ratio of proposed_slots : total_slots_attributed.

However, given that validators enter the network at different points in time, it is wise to control for the periods when each validator has been active (measured in epochs), as well as the diminishing probability of being selected as a proposer as the total activated validators increases.

In eth2data.github.io, we introduced a further optimization to the ratio to capture the difficulty factor — by dividing the time-weighted ratio of proposed_slots : total_slots_attributed with the probability of proposing at least once, given a validator’s activation epoch in the testnet. This is the inverse of the probability that they were allocated 0 slots over the epochs they have been active for (n), such that:

P(p1)=1−P(p0)*n

Which, plotted over ~14,700 epochs, looks like this:

BT Placeholder

Figure 2: Probability of proposing a block at least once in Medalla, over activation epochs

With this in mind we defined proposer effectiveness as:

n = epochs active

P_ratio = proposed_slots / total_slots_attributed

P_time_weight = 1 / epochs active

P_prob_weight = 1−P(0_proposals)*n

P_effectiveness = [P_ratio * P_time_weight] / P_prob_weight

Given that the probability of proposing at least once diminishes quickly in the epochs closer to the present, dividing by P_prob_weight disproportionately boosts the overall score of the proposer.

A further set of optimizations to the P_effectiveness score, introduced here to counter the distortions described above, are:

  • normalizing the set of scores and giving them a percentile score out of 100

  • excluding the last 2,000 epochs from the overall calculation

This is what the distribution of ~70k validators that we tracked in Medalla looks like when scored for P_effectiveness.

BT Placeholder

Figure 3: Distribution of proposer effectiveness scores in Medalla

At the edges of the distribution, we observe a large concentration of really ineffective proposers (10% of the whole) that have missed virtually every opportunity they had to propose a block, and very effective proposers (5% of the whole) that have proposed in every slot they were allotted. Between those we find an exceedingly large representation of proposers at the 40% level of P_effectiveness. The remainder approximately follows a normal distribution with a mode of 0.35.

So far so good, but proposer effectiveness is only ˜⅛ of the story...

Attester effectiveness

Broadly, in eth2 the two key variables for attesters to optimize around are (i) getting valuable attestations included on-chain (a function of uptime), and (ii) the inclusion delay. From an attestation rewards point of view, to optimize for rewards the attester must always participate in consensus — and must do so swiftly when their time comes.

“Rewards emission in eth2 is designed in such a way that the later an attestation gets included, the lesser the rewards that are emitted back to the attester”

For example, with 1 being the minimum possible delay, an inclusion delay of 2 slots leads to the attester receiving only half of the maximum reward available. The expected rewards “moderator” distribution is presented below.

BT Placeholder

Figure 4: inclusion distance vs the corresponding moderator (%) of the max attester reward

Given the above, the main categories we focused our attention on when scoring for attester effectiveness were:

  • aggregate inclusion delay — measured as the average of inclusion delays a validator has been subject to. A score of 1 means that the validator always got their attestations in without a delay, and maximized their rewards potential.

  • uptime ratio — measured as the number of valuable attestations against the time a validator has been active (in epochs). A ratio of 1 implies that the validator has been responsive in every round they have been called to attest in.

We defined the attester effectiveness score as:

A_effectiveness = uptime ratio * 100 / aggregate inclusion delay

To optimize generalizability, here a normalization of the A_effectiveness score is introduced so that it tops out at 100%.

BT Placeholder

Figure 5: distribution of attester effectiveness across 75,000 validator indices in Medalla

Here we find that 12% of the active validators in the first 14,500 epochs of Medalla scored below 10% in A_effectiveness — clearly a derivative of the fact that Medalla is a testnet with no real value at stake and probably exaggerated by the roughtime incident.

Validator effectiveness

The final step for arriving at a comprehensive validator effectiveness score is to combine the attester and proposer effectiveness scores into one “master” score.

To simplify the process we define the master validator effectiveness score as:

V_effectiveness = A_effectiveness * ⅞ + P_effectiveness * ⅛

V_effectiveness here reflects the ecosystem-wide expected distribution of ETH rewards that validators stand to achieve by performing their attester and proposer duties.

Given the much heavier weighting of attester effectiveness in the score, the distribution of V_effectiveness ends up looking very much like that of A_effectiveness.

BT Placeholder

Figure 6: distribution of validator effectiveness across 75,000 validator indices in Medalla

To test the accuracy of this approach to calculating V_effectiveness, we plotted the score against the total rewards that the validators achieved over their lifetime in Medalla, and tested for how good a predictor of rewards V_effectiveness can be.

To ensure that the comparison is “apples-to-apples,” for the purpose of the exercise we selected only the group of validators that were active at genesis (20k unique indices).

BT Placeholder

Figure 7: validator effectiveness score vs ETH rewards achieved over the Medalla lifecycle — for validators that were active since genesis

BT placeholder

Figure 8: OLS regression results table of ETH rewards (dependent variable) vs validator effectiveness score (independent variable)

“The result was an astounding 86.5% correlation between the two variables — meaning that the validator effectiveness score is an excellent predictor of the ETH rewards a validator stands to achieve in eth2”

When broadening the correlations matrix to capture a wider range of variables of interest, we find that the correlation between V_effectiveness and rewards drops to 50% — probably because the different entry points and exposure to varying network conditions become moderating factors.

BT placeholder

Figure 9: correlations between key validator effectiveness variables

A few more observations worth noting:

(i) Uptime is a lot more closely correlated to V_effectiveness than is the inclusion delay — and is thus likely what operators should optimize for first. (ii) V_effectiveness improved significantly as Medalla matured, precisely because the aggregate inclusion delay improved greatly — as demonstrated by the 70% positive correlation between the delay and epochs_active. (iii) As for the attestations surplus introduced on-chain, there is only a weak relationship between a validator’s effectiveness and the amount of surplus info with which they load the chain — meaning that the top performers (from a rewards perspective) may be only marginally less net “pollutants.”

When testing for validator effectiveness and rewards in groups of validators aggregated by their most common denominator (a mix of graffiti, self id name, common withdrawal keys, and common eth1 deposit address), the results are equally strong.

BT placeholder

Figure 10: validator effectiveness score vs ETH rewards and distribution of validator effectiveness scores grouped by operator group

The correlation between rewards and V_effectiveness stands at 80%, while in the distribution view we can discern between three groups of operators that performed poorly (C), average (B), or well (A).

Zooming in on the top performers (A) and taking a client choice perspective, we were able to identify only 30% of the ~17k validator indices in the group. 93% of them identified as either Prysm or Lighthouse, with the representation in the group from Teku, Nimbus and Lodestar at under 1% of the sample. 

Given that this distribution is not representative of the population, and that the unidentified 70% recorded the highest validator effectiveness score, there is no conclusion to extract here with respect to the relationship between performance and client choice.

BT placeholder

Figure 11: summary view of client choice of top performing operator groups

Zooming out again at the population level and segmenting the view of V_effectiveness distributions by client choice, Prysm and Lighthouse score at the top, with Teku, Lodestar and Nimbus following.

BT placeholder

Figure 12: summary view of validator effectiveness by client choice

BT placeholder

Figure 13: validator effectiveness score distributions by client choice — population view

“This is a strong hint toward the fact that client choice has only so much to do with validator performance. The remainder may lean on the strength of the operator’s design choices”

What is perhaps the most robust finding here is the fact that even among the top performing clients, there seems to be a significantly wide distribution in in-group validator effectiveness — commensurate with the picture painted by the aggregate distribution of validators along their effectiveness score. This is a strong hint toward the fact that client choice has only so much to do with validator performance. The remainder likely leans on the strength of the operator’s design choices.

Concluding remarks

In this post, we developed a feature-complete validator-effectiveness methodology that is not only a strong predictor of ETH rewards, but also takes a relative approach to scoring by normalizing scores.

Given that the majority of the factors governing rewards in eth2 are relative to the state of the network, we believe that, while computationally more demanding, the approach we introduce here paints a more accurate picture compared to methodologies that take a nominal view.

We also found that validators running the Prysm and Lighthouse clients recorded better performance in Medalla, on aggregate. However, given the ever-changing macro-level conditions in the testnet — as well as the fact that segmenting for client choice by graffiti is an imperfect way to do it — there are no strong conclusions with respect to the relationship between client choice and performance.

It’s worth underscoring, however, what is crystal clear: that a large chunk of the determinants of performance lie outside client choice, and more closely relate to the robustness of the operator’s set-up.