reth
reth copied to clipboard
WIP feat: bounded consensus events channel
Requires #8193
There is an unbounded channel for consensus events sent from the RPC.
The channel to_engine is used by the BeaconConsensusEngineHandle to handle messages from the RPC. The channel is unbounded and therefore susceptible to DoS attacks or excessive resource consumption in times of congestion.
The RPC is only callable from the consensus client which is considered a trusted actor and therefore the risk of DoS attack is low. However, during times of congestion a large number of events could build up. Processing events includes executing blocks and updating the blockchain tree. Thus, it may be possible that events are added to the channel faster than they may be processed and the channel will increase in volume.
This PR replaces the unbounded channel with a bounded one, for now setting a default channel size of 1000, this should be adapted after gathering usage metrics.
indeed, dropping engine messages in congestion conflicts with this issue about not dropping engine messages https://github.com/paradigmxyz/reth/issues/8203
Hmm this is a tough one which seems it isn't great either way. To add some further pain to the issue, the EngineAPI specs do not prevent replay attacks which would be a pain here. Note the timestamp needs to be within 1 minute of the current clock so replays are only valid that long.
The authentication scheme is not designed to
- Prevent attackers with capability to read ('sniff') network traffic from reading the traffic,
- Prevent attackers with capability to read ('sniff') network traffic from performing replay-attacks of earlier messages.
I don't imagine this is a common case but someone with read access to the EL - CL connection could replay messages. An example of a malicious case would be if there are two FCU updates which change from chain A to chain B. If the attacker spammed the connection with FCU A - FCU B - FCU A - FCU B - ... it would be quite a lot of processing. Now if we have a bounded channel they'd be able to use this replay to cause genuine messages to be dropped which is equally as bad as a resource DoS.
In terms of dropping excess FCUs, if the most recent FCU has a valid canonical head then all of the older FCUs can be safely dropped. However, if a FCU does not contain the canonical head then older FCU are relevant as they could update the canonical head. So there's potential here to minimize the number of FCUs we process by filtering some old ones.
In terms of dropping new payloads, it would not be good to drop any canonical blocks as these would need to be fetched again and processed later. So I believe we'd need to process all of the new payloads unless some logic is implemented to determine which are not canonical.
tl;dr So maybe a tidier solution here is to not bound the channel and drain all of the messages into a queue. In the queue some management can occur
- detecting and combining duplicate FCUs or new payloads
- some optimistic ordering of FCUs to process the most recent first (would need to make sure we've processed enough new payloads)
- dropping any FCUs older that the most recent valid FCU that has been processed.
how about just upgrading the connection to CL to https @kirk-baird ?
how about just upgrading the connection to CL to
https@kirk-baird ?
Yep HTTPS has replay protection so that would prevent re-using old requests.
how about just upgrading the connection to CL to
https@kirk-baird ?Yep HTTPS has replay protection so that would prevent re-using old requests.
would it require a fix on CL side to support this as an extra security feature or is it already supported? @kirk-baird
The engine api connection requires a valid jwt, how feasible is something like this? I assume an attacker on the same machine is out of scope, would it need to feed the right messages to the CL s.t. it sends these messages?
sorry clicked the wrong button...
ups this was pointing to #8193 's branch and was automatically closed when it was merged and the branch removed, reopening
The engine api connection requires a valid jwt, how feasible is something like this? I assume an attacker on the same machine is out of scope would it need to feed the right messages to the CL s.t. it sends these messages?
right, risk is very low. however, with this fix, the risk that node can't recover in network congestion due to dropped engine messages comes into existence @Rjected
The engine api connection requires a valid jwt, how feasible is something like this? I assume an attacker on the same machine is out of scope, would it need to feed the right messages to the CL s.t. it sends these messages?
Good question, so the issue here is that the valid jwt is replayable. That means if an attacker reads a message with a JWT they can replay that message as many times as they desire until it expires ~1 minute. An attacker isn't able to create new messages since they can't forge a jwt only re-use existing tokens. Note that each JWT is tied to a single message via HMAC so an attacker can't take a valid one and use it on a different message.
So the likelihood of this issue is determined based on how easy it is to read messages and send them again. Now by default these connections are not encrypted and use http or websockets. If the CL + EL are on the same machine then reading the connection means you've access to the machine and so you can do much worse attacks, we're not worried about this case.
The case to think about is when the EL+CL are on different machines and communicate over an untrusted network e.g. regular HTTP can be eavesdropped. Upgrading this connection to HTTPS when they're not on the same machine would resolve this issue as it prevents replay and also encrypts the messages.
Setting the channel to bounded, without doing much else, could cause similar trouble with the message sitting in a waiting tokio task instead of the channel.
Instead, you'd likely want to think about a solution involving try_send or send_timeout
In terms of correctness with the CL, the execution API spec already describes a notion of a timeout and retry. When under heavy load with an unbounded channel, Reth's eventual response could be long past the timeout and ignored. The channel could also contain duplicate (retried) requests.
thanks @gnattishness
engine_newPayloadV3 and engine_forkchoiceUpdatedV3 have 8 second timeouts.
how about keeping the unbounded channel in that case, but polling it not when EL is ready to process the engine message, instead at short intervals. we do this and queue the engine messages in a type that evicts entries on-op based on the timeout, when inserting new entries. much like this type https://github.com/sigp/discv5/blob/8eac0fee8da65c4021c6bba7c976d395151309ed/src/lru_time_cache.rs#L7-L18.
I think that type can replace the VecDeque mentioned in the conflicting issue https://github.com/paradigmxyz/reth/issues/8203.
closing in favour of addressing https://github.com/paradigmxyz/reth/issues/8203 wrt to spec'd timeouts for messages