reth icon indicating copy to clipboard operation
reth copied to clipboard

WIP feat: bounded consensus events channel

Open fgimenez opened this issue 1 year ago • 2 comments

Requires #8193

There is an unbounded channel for consensus events sent from the RPC.

The channel to_engine is used by the BeaconConsensusEngineHandle to handle messages from the RPC. The channel is unbounded and therefore susceptible to DoS attacks or excessive resource consumption in times of congestion.

The RPC is only callable from the consensus client which is considered a trusted actor and therefore the risk of DoS attack is low. However, during times of congestion a large number of events could build up. Processing events includes executing blocks and updating the blockchain tree. Thus, it may be possible that events are added to the channel faster than they may be processed and the channel will increase in volume.

This PR replaces the unbounded channel with a bounded one, for now setting a default channel size of 1000, this should be adapted after gathering usage metrics.

fgimenez avatar May 14 '24 11:05 fgimenez

indeed, dropping engine messages in congestion conflicts with this issue about not dropping engine messages https://github.com/paradigmxyz/reth/issues/8203

emhane avatar May 18 '24 13:05 emhane

Hmm this is a tough one which seems it isn't great either way. To add some further pain to the issue, the EngineAPI specs do not prevent replay attacks which would be a pain here. Note the timestamp needs to be within 1 minute of the current clock so replays are only valid that long.

The authentication scheme is not designed to

  • Prevent attackers with capability to read ('sniff') network traffic from reading the traffic,
  • Prevent attackers with capability to read ('sniff') network traffic from performing replay-attacks of earlier messages.

I don't imagine this is a common case but someone with read access to the EL - CL connection could replay messages. An example of a malicious case would be if there are two FCU updates which change from chain A to chain B. If the attacker spammed the connection with FCU A - FCU B - FCU A - FCU B - ... it would be quite a lot of processing. Now if we have a bounded channel they'd be able to use this replay to cause genuine messages to be dropped which is equally as bad as a resource DoS.

In terms of dropping excess FCUs, if the most recent FCU has a valid canonical head then all of the older FCUs can be safely dropped. However, if a FCU does not contain the canonical head then older FCU are relevant as they could update the canonical head. So there's potential here to minimize the number of FCUs we process by filtering some old ones.

In terms of dropping new payloads, it would not be good to drop any canonical blocks as these would need to be fetched again and processed later. So I believe we'd need to process all of the new payloads unless some logic is implemented to determine which are not canonical.

tl;dr So maybe a tidier solution here is to not bound the channel and drain all of the messages into a queue. In the queue some management can occur

  • detecting and combining duplicate FCUs or new payloads
  • some optimistic ordering of FCUs to process the most recent first (would need to make sure we've processed enough new payloads)
  • dropping any FCUs older that the most recent valid FCU that has been processed.

kirk-baird avatar May 20 '24 01:05 kirk-baird

how about just upgrading the connection to CL to https @kirk-baird ?

emhane avatar May 21 '24 16:05 emhane

how about just upgrading the connection to CL to https @kirk-baird ?

Yep HTTPS has replay protection so that would prevent re-using old requests.

kirk-baird avatar May 22 '24 01:05 kirk-baird

how about just upgrading the connection to CL to https @kirk-baird ?

Yep HTTPS has replay protection so that would prevent re-using old requests.

would it require a fix on CL side to support this as an extra security feature or is it already supported? @kirk-baird

emhane avatar May 22 '24 12:05 emhane

The engine api connection requires a valid jwt, how feasible is something like this? I assume an attacker on the same machine is out of scope, would it need to feed the right messages to the CL s.t. it sends these messages?

Rjected avatar May 22 '24 14:05 Rjected

sorry clicked the wrong button...

Rjected avatar May 22 '24 14:05 Rjected

ups this was pointing to #8193 's branch and was automatically closed when it was merged and the branch removed, reopening

fgimenez avatar May 22 '24 18:05 fgimenez

The engine api connection requires a valid jwt, how feasible is something like this? I assume an attacker on the same machine is out of scope would it need to feed the right messages to the CL s.t. it sends these messages?

right, risk is very low. however, with this fix, the risk that node can't recover in network congestion due to dropped engine messages comes into existence @Rjected

emhane avatar May 22 '24 20:05 emhane

The engine api connection requires a valid jwt, how feasible is something like this? I assume an attacker on the same machine is out of scope, would it need to feed the right messages to the CL s.t. it sends these messages?

Good question, so the issue here is that the valid jwt is replayable. That means if an attacker reads a message with a JWT they can replay that message as many times as they desire until it expires ~1 minute. An attacker isn't able to create new messages since they can't forge a jwt only re-use existing tokens. Note that each JWT is tied to a single message via HMAC so an attacker can't take a valid one and use it on a different message.

So the likelihood of this issue is determined based on how easy it is to read messages and send them again. Now by default these connections are not encrypted and use http or websockets. If the CL + EL are on the same machine then reading the connection means you've access to the machine and so you can do much worse attacks, we're not worried about this case.

The case to think about is when the EL+CL are on different machines and communicate over an untrusted network e.g. regular HTTP can be eavesdropped. Upgrading this connection to HTTPS when they're not on the same machine would resolve this issue as it prevents replay and also encrypts the messages.

kirk-baird avatar May 23 '24 00:05 kirk-baird

Setting the channel to bounded, without doing much else, could cause similar trouble with the message sitting in a waiting tokio task instead of the channel. Instead, you'd likely want to think about a solution involving try_send or send_timeout

In terms of correctness with the CL, the execution API spec already describes a notion of a timeout and retry. When under heavy load with an unbounded channel, Reth's eventual response could be long past the timeout and ignored. The channel could also contain duplicate (retried) requests.

gnattishness avatar May 23 '24 05:05 gnattishness

thanks @gnattishness

engine_newPayloadV3 and engine_forkchoiceUpdatedV3 have 8 second timeouts.

how about keeping the unbounded channel in that case, but polling it not when EL is ready to process the engine message, instead at short intervals. we do this and queue the engine messages in a type that evicts entries on-op based on the timeout, when inserting new entries. much like this type https://github.com/sigp/discv5/blob/8eac0fee8da65c4021c6bba7c976d395151309ed/src/lru_time_cache.rs#L7-L18.

I think that type can replace the VecDeque mentioned in the conflicting issue https://github.com/paradigmxyz/reth/issues/8203.

emhane avatar May 25 '24 13:05 emhane

closing in favour of addressing https://github.com/paradigmxyz/reth/issues/8203 wrt to spec'd timeouts for messages

emhane avatar May 30 '24 00:05 emhane