mev-boost Relay monitoring & preventing continued relay errors

Once a proposer calls submitBlindedBlock to a relay (with a signed header), it depends on the relay to release the block to be able to propose anything (no fallback to a local block is possible at that point due to possible slashing).

There's several relay error scenarios:

payload withholding (relay doesn't release the payload and the proposer needs to forfeit the slot)
incorrect payload a. incorrect value (the final amount paid by the builder to the proposer was different to the amount claimed in the BuilderBid) b. invalid block (invalid data / fields)

Question: How can we shield proposers from faulty relays, and how to prevent continuous slots with errors due to faulty relay behaviour?

A possible solution is a monitoring service run by a trusted third-party, which we can call Relay Monitor (RM).

Whenever mev-boost calls submitBlindedBlock to a relay, it also sends a request to the RM, including the SignedBuilderBid, the relay it originated from, and the submitBlindedBlock body.
The RM will also request the payload from the relay
Thus the RM can check a. whether the payload is withheld b. whether the block matches the bid

If there is any problem, the relay's scoring/reputation is be updated in the RM, and propagated to all connected proposers (by mev-boost polling the relay status endpoint, maybe also push as an option). If any relay behaves incorrectly, all connected proposers can ignore faulty relays for some time. (reputation mechanism TBD).

This (centralised) service can be put into production quickly, and can mitigate a range of issues resulting from faulty relays. It should be run by a trusted party, and could be replaced in the mid- to longer term with a more decentralized/trustless solution.

TBD:

Reputation mechanics: what exactly happens on a single instance of any of the errors?
Who should run a relay monitor, and how many instances are the sweet spot? There's an argument for having a small number, because (a) the more proposers connect to it the more it knows about relay issues, and (b) is has a lot of "power" in that it can blacklist relays.

Tl;dr: A relay monitor could observe any relay problem a validator experiences, and can tell all the other connected validators about problems with a specific relay. Thus, if a relay causes a problem with one validator, all the other connected validators would immediately know, and could avoid that relay for some time (or whatever mechanic).

Jun 09 '22 11:06 metachris

Here's a rough diagram outlining the setup (src):

Untitled-2022-06-10-0925

Jun 10 '22 07:06 metachris

Not sure if you saw this doc by Yoav, but I think he has a nice outline for what should be done to keep relays "honest": https://notes.ethereum.org/@yoav/BJeOQ8rI5

A few general thoughts here:

I think the RM should really be multiple actors, otherwise we're not all the much better off than with a relay. Just trusting a different person.
Reputation mechanics -- I think withholding should pretty much be insta-ban. Not sure about invalid block, but also feels like a pretty bad fault.

Jun 13 '22 09:06 lightclient

https://notes.ethereum.org/@yoav/BJeOQ8rI5

I think the RM should really be multiple actors, otherwise we're not all the much better off than with a relay. Just trusting a different person.

Good link, it states the problem clearly and hints at a solution based on a committee. It's not yet clear how such a committee would work.

A decentralized setup would definitely be great, and I can see that as a possible next step. It does seems to first require a bunch of work on specification, research and prototyping, to explore the consensus protocol, committee duties and repercussions for malicious behavior.

Reputation mechanics -- I think withholding should pretty much be insta-ban. Not sure about invalid block, but also feels like a pretty bad fault.

Withholding once could actually be a networking issue, I don't think that should be a permaban instantly. Maybe banning for a few hours at first would suffice, and increasing penalties for repeated offenses 🤔

Jun 13 '22 10:06 metachris

Could relay reply different responses to RM than to mev-boost? I don't see why relay would do this, but just a thought

Jun 30 '22 22:06 terencechain

I don't think a relay has a reliable way to distinguish relay vs monitor 🤔 The payload is the same, although maybe the request profile over time is different...

Jul 01 '22 07:07 metachris

I would like to see one minor change, which is that the proposer node can connect to multiple monitors (not just one), and the monitors can connect and talk to each other. While a full gossip network would be ideal, just having a hub and spoke (decentralized) system where clients can connect to multiple hubs is probably "good enough" to buy us time until a more complete gossip network can be setup and secured.

Jul 28 '22 15:07 MicahZoltu

I don't think a relay has a reliable way to distinguish relay vs monitor 🤔 The payload is the same, although maybe the request profile over time is different...

esp if monitors are "trusted third parties" then they will be well-known entities with fairly fixed IPs, relays could definitely use this to discriminate responses although I don't see how this could be gamed right now

Aug 02 '22 23:08 ralexstokes

Linking the current design doc by @ralexstokes: https://hackmd.io/@ralexstokes/SynPJN_pq

Aug 24 '22 10:08 metachris

For the SecureRpc Relay we are operating, here are some metrics/KPIs that are collected and some diagrams (out of date, but should be slightly helpful)

Metrics collected: https://gist.github.com/sambacha/d613f8be00caa50befe0c7a8e1dda073
Grafana Dashboard screen shot: screencapture-grafana-manifoldx-d-RixFH2jnz-relay-overview-copy-2022-08-26-13_02_03 (Grafana v9.1 allows public dashboards, so we should be able to make this publicly queryable soon)

Aug 28 '22 00:08 sambacha

Linking Alex's relay monitor implementation

Sep 06 '22 19:09 kailinr

Once a proposer calls submitBlindedBlock to a relay (with a signed header), it depends on the relay to release the block to be able to propose anything (no fallback to a local block is possible at that point due to possible slashing).

There's several relay error scenarios:

payload withholding (relay doesn't release the payload and the proposer needs to forfeit the slot)

incorrect payload a. incorrect value (the final amount paid by the builder to the proposer was different to the amount claimed in the BuilderBid) b. invalid block (invalid data / fields)

Question: How can we shield proposers from faulty relays, and how to prevent continuous slots with errors due to faulty relay behaviour?

A possible solution is a monitoring service run by a trusted third-party, which we can call Relay Monitor (RM).

Whenever mev-boost calls submitBlindedBlock to a relay, it also sends a request to the RM, including the SignedBuilderBid, the relay it originated from, and the submitBlindedBlock body.

The RM will also request the payload from the relay

Thus the RM can check a. whether the payload is withheld b. whether the block matches the bid

If there is any problem, the relay's scoring/reputation is be updated in the RM, and propagated to all connected proposers (by mev-boost polling the relay status endpoint, maybe also push as an option). If any relay behaves incorrectly, all connected proposers can ignore faulty relays for some time. (reputation mechanism TBD).

This (centralised) service can be put into production quickly, and can mitigate a range of issues resulting from faulty relays. It should be run by a trusted party, and could be replaced in the mid- to longer term with a more decentralized/trustless solution.

TBD:

Reputation mechanics: what exactly happens on a single instance of any of the errors?

Who should run a relay monitor, and how many instances are the sweet spot? There's an argument for having a small number, because (a) the more proposers connect to it the more it knows about relay issues, and (b) is has a lot of "power" in that it can blacklist relays.

Tl;dr: A relay monitor could observe any relay problem a validator experiences, and can tell all the other connected validators about problems with a specific relay. Thus, if a relay causes a problem with one validator, all the other connected validators would immediately know, and could avoid that relay for some time (or whatever mechanic).

Please correct me if I'm wrong, but wouldn't the afformentioned issues be tackled by applying a BFT system for the proposer <-> relay interaction? I.e., since the error scenarios stated here show that a relay can behave in a byzantine manner, maybe a BFT system where a proposer can send the request to (3f+1) relay nodes and based on a consensus retrieve the correct answer and avoid the issues stated above. BFT would ensure strong consistency regarding latest payloads and correct values compared to the eventual consistency of gossip network consensus model. Furthermore, the trust model of a BFT system is stricter than gossip network. On the other hand, the communication overhead of the gossip network is less than the one from BFT.

In Both cases, the caveat here is that it would possible increase the complexity and degrade the performance as a tradeoff.

PS. please excuse my limited knowledge in BFT/Gossip systems in case what I mentioned above is incorrect.

Sep 07 '23 11:09 MoeMahhouk

this is an interesting avenue for exploration; however, the prevailing model is that relays are not guaranteed to share any of their bids/data so having a BFT style approach across disparate actors doesn't really make sense...

there has been a thread we have been dancing around on the mev-boost community calls for some time that points towards a different model where independent entities do just run some kind of "relay" node and then builders are expected to publish to all of them, e.g. over some kind of gossip net -- and in this case we could imagine some kind of consensus over the "bid pool" that reduces room for byzantine behavior

that being said, the "live" pathways of the relay are incredibly latency sensitive so unless the consensus process really brought substantial benefits I think it would be hard to get adoption

and I think we'd also need to move towards more of an optimistic regime, see something like v3 here: https://github.com/michaelneuder/optimistic-relay-documentation/blob/main/towards-epbs.md

Sep 08 '23 14:09 ralexstokes

mev-boost mev-boost copied to clipboard

Relay monitoring & preventing continued relay errors

mev-boost
mev-boost copied to clipboard