Delayed RPC Send Using Tokens
Issue Addressed
closes https://github.com/sigp/lighthouse/issues/5785
Proposed Changes
The diagram below shows the differences in how the receiver (responder) behaves before and after this PR. The following sentences will detail the changes.
flowchart TD
subgraph "*** After ***"
Start2([START]) --> AA[Receive request]
AA --> COND1{Is there already an active request <br> with the same protocol?}
COND1 --> |Yes| CC[Send error response]
CC --> End2([END])
COND1 --> |No| COND2{Request is too large?}
COND2 --> |Yes| CC
COND2 --> |No| DD[Process request]
DD --> EE{Rate limit reached?}
EE --> |Yes| FF[Wait until tokens are regenerated]
FF --> EE
EE --> |No| GG[Send response]
GG --> End2
end
subgraph "*** Before ***"
Start([START]) --> A[Receive request]
A --> B{Rate limit reached <br> or <br> request is too large?}
B -->|Yes| C[Send error response]
C --> End([END])
B -->|No| E[Process request]
E --> F[Send response]
F --> End
end
Is there already an active request with the same protocol?
This check is not performed in Before. This is taken from the PR in the consensus-spec, which proposes updates regarding rate limiting and response timeout.
https://github.com/ethereum/consensus-specs/pull/3767/files
The requester MUST NOT make more than two concurrent requests with the same ID.
The PR mentions the requester side. In this PR, I introduced the ActiveRequestsLimiter for the responder side to restrict more than two requests from running simultaneously on the same protocol per peer. If the limiter disallows a request, the responder sends a rate-limited error and penalizes the requester.
Request is too large?
EDIT: I removed the RequestSizeLimiter and added RPC::is_request_size_too_large() instead, which checks if the count of requested blocks/blobs is within the number defined in the specification. That looks much simpler.
discussion: https://github.com/sigp/lighthouse/pull/5923#discussion_r1689948084 commit log: https://github.com/sigp/lighthouse/pull/5923/commits/5a9237fdc6a5ad11281ec7c36b2b7bdfdeba7ed4
~~This request size check is also performed in Before using the Limiter, but in this PR, I introduced RequestSizeLimiter to handle this. Unlike the Limiter, RequestSizeLimiter is dedicated to perform this check.~~
~~The reasons why I introduced RequestSizeLimiter are:~~
- ~~In
After, the rate limiter is shared between the behaviour and the handler (Arc<Mutex<RateLimiter>>>). (This is detailed at the next sentence)~~ - ~~The request size check does not heavily depend on the rate-limiting logic, so it can be separated with minimal code duplication.~~
- ~~The request size check is performed on the behaviour side. By separating the request size check from the rate limiter, we can reduce the locking of the rate limiter.~~
Rate limit reached? and Wait until tokens are regenerated
The rate limiter is shared between the behaviour and the handler. (Arc<Mutex<RateLimiter>>>) The handler checks the rate limit and queues the response if the limit is reached. The behaviour handles pruning.
I considered not sharing the rate limiter between the behaviour and the handler, and performing all of these either within the behaviour or handler. However, I decided against this for the following reasons:
- Regarding performing everything within the behaviour: The behaviour is unable to recognize the response protocol when
RPC::send_response()is called, especially when the response isRPCCodedResponse::Error. Therefore, the behaviour can't rate limit responses based on the response protocol. - Regarding performing everything within the handler: When multiple connections are established with a peer, there could be multiple handlers interacting with that peer. Thus, we cannot enforce rate limiting per peer solely within the handler. (Any ideas? 🤔 )
Additional Info
Naming
I have renamed the fields of the behaviour to make them more intuitive:
- limiter -> response_limiter
- self_limiter -> outbound_request_limiter
Testing
I have run beacon node with this changes for 24hours, it looks work fine.
The rate-limited error has not occurred anymore while running this branch.
This is a significant change to our RPC behaviour. As we are already putting quite a lot of significant upgrades into 5.3, i suggest we hold off till 5.3.1 for this PR
@jimmygchen - Yep, we essentially remove the rate limit response.
This has been a point of contention from the beginning of consensus clients. Some clients will not spec a rate-limit response and feel its the wrong strategy.
If we can't spec a rate-limit response, then sending one isn't very useful. Even at this moment, Lighthouse doesn't really acknowledge the rate limit response from another Lighthouse node. It just treats it as a general error. We dont have logic to parse the timeout and then re-request. Adding this kind of logic (especially in sync as you know) will complicate things quite a bit.
After quite a few discussions with other client teams, it seems the lesser of all evils is to take this approach. Where we naturally rate limit them in time. If we are to rate limit someone over the 10 second mark, we should just not send the message at all. At the moment, this is equivalent (from a peer scoring perspective) as sending a rate-limit message. Both are down scored. The down-side of this approach is that the receiver has to wait 10 seconds to find out that the message isn't coming. I guess in this case, we should send a rate limit message to not make them wait 10 seconds.
I'm keen to get this in. @ackintosh - Let me know when comments are resolved and this is ready for a final review
@AgeManning @pawanjay176 @jxs This PR is now ready for review again. I've addressed the previous feedback and merged the latest unstable branch.
~~I received advice from Pawan, and I am now trying to add the protocol info to RPC errors so that we can handle all the rate-limiting logic within the behaviour.~~
https://github.com/sigp/lighthouse/pull/6444 has been merged, so we can remove the need to share a Mutex.
Based on #6444, I've fixed the Mutex issue. Now we have the entire response limiting logic in the behaviour instead of shareing between the behaviour and handler.
Also I'm digging into an issue where a node is immediately banned with the following logs:
3117:Oct 08 12:37:45.665 DEBG RPC Error direction: Incoming, score: inf, peer_id: 16Uiu2HAkw8cyrjC4UpnQKxdDGHsCQwK32PBKMmdc4xBdsnC8iDYg, client: Lighthouse: version: v5.3.0-156565c+, os_version: aarch64-macos, err: RPC response was an error: Rate limited with reason: Rate limited. There is an active request with the same protocol, protocol: data_column_sidecars_by_root, service: libp2p
3598:Oct 08 12:37:57.658 DEBG RPC Error direction: Incoming, score: inf, peer_id: 16Uiu2HAkw8cyrjC4UpnQKxdDGHsCQwK32PBKMmdc4xBdsnC8iDYg, client: Lighthouse: version: v5.3.0-156565c+, os_version: aarch64-macos, err: RPC response was an error: Rate limited with reason: Rate limited. There is an active request with the same protocol, protocol: data_column_sidecars_by_root, service: libp2p
3645:Oct 08 12:37:58.599 DEBG RPC Error direction: Incoming, score: inf, peer_id: 16Uiu2HAkw8cyrjC4UpnQKxdDGHsCQwK32PBKMmdc4xBdsnC8iDYg, client: Lighthouse: version: v5.3.0-156565c+, os_version: aarch64-macos, err: RPC response was an error: Rate limited with reason: Rate limited. There is an active request with the same protocol, protocol: data_column_sidecars_by_root, service: libp2p
@pawanjay176 @AgeManning @jxs
As Pawan mentioned here, we should add a limiter logic similar to ActiveRequestsLimiter in self limiter as well, to limit max concurrent streams per protocol on the client side. From my testing on a local testnet, multiple data_column_sidecars_by_root requests are being sent to the same peer at the same time. It seems that this is more likely to occur when the network is smaller in scale.
I think, unlike the server side limiter, the client side limiter we'll add would just delay the request that exceeds max concurrent streams per protocol, instead of rejecting it. So, the limiter in SelfRateLimiter should be replaced with ActiveRequestsLimiter, so that requests are limited based on the number of concurrent requests rather than tokens.
What do you think? I just want to make sure this is the right direction.
@pawanjay176 @AgeManning @jxs As Pawan mentioned here, we should add a limiter logic similar to
ActiveRequestsLimiterin self limiter as well, to limit max concurrent streams per protocol on the client side. From my testing on a local testnet, multipledata_column_sidecars_by_rootrequests are being sent to the same peer at the same time. It seems that this is more likely to occur when the network is smaller in scale.I think, unlike the server side limiter, the client side limiter we'll add would just delay the request that exceeds max concurrent streams per protocol, instead of rejecting it. So, the limiter in
SelfRateLimitershould be replaced withActiveRequestsLimiter, so that requests are limited based on the number of concurrent requests rather than tokens.What do you think? I just want to make sure this is the right direction.
@pawanjay176 @jxs
I'm not fully across this. But I recall the self-limiter is optional and it is used so that we don't try and send more messages than would break the rate limit. If we set the limiter to only be based on number of concurrent requests, we could still make a concurrent request that breaks the rate limit? Or is that not possible?
Just curious as to whether we need to limit tokens and concurrent requests or is just concurrent requests fine?
@pawanjay176 @AgeManning @jxs As Pawan mentioned here, we should add a limiter logic similar to
ActiveRequestsLimiterin self limiter as well, to limit max concurrent streams per protocol on the client side. From my testing on a local testnet, multipledata_column_sidecars_by_rootrequests are being sent to the same peer at the same time. It seems that this is more likely to occur when the network is smaller in scale. I think, unlike the server side limiter, the client side limiter we'll add would just delay the request that exceeds max concurrent streams per protocol, instead of rejecting it. So, the limiter inSelfRateLimitershould be replaced withActiveRequestsLimiter, so that requests are limited based on the number of concurrent requests rather than tokens. What do you think? I just want to make sure this is the right direction.@pawanjay176 @jxs
I'm not fully across this. But I recall the self-limiter is optional and it is used so that we don't try and send more messages than would break the rate limit. If we set the limiter to only be based on number of concurrent requests, we could still make a concurrent request that breaks the rate limit? Or is that not possible?
Just curious as to whether we need to limit tokens and concurrent requests or is just concurrent requests fine?
if and after the spec PR gets merged the self limiter (or some reworked form of self limiter) should not be optional, as per spec update:
The requester MUST NOT make more than
MAX_CONCURRENT_REQUESTSconcurrent requests with the same protocol ID.
Just curious as to whether we need to limit tokens and concurrent requests or is just concurrent requests fine?
I tested this with a simple script. Even if we set the limiter to be based only on the number of concurrent requests, we still make a concurrent request that breaks the rate limit. So, it seems we need to limit both tokens and concurrent requests, although it adds more complexity. Limiting tokens could be optional, but limiting concurrent requests should not be, according to the spec.
Even if we set the limiter to be based only on the number of concurrent requests, we still make a concurrent request that breaks the rate limit.
I did not understand this. If we get a request that breaks the concurrent limit, why can't we just wait to send that until the existing streams have concluded?
Right now, if I run this branch on a kurtosis peerdas network, the lighthouse server rightly responds with an error because the lighthouse client is breaking the concurrency limit
Oct 30 02:23:38.044 DEBG RPC Error direction: Outgoing, score: 0, peer_id: 16Uiu2HAmAcK84B5BJz3EiXzLK7SzQsEi7JWNsqL1YcdiAGZDj4aA, client: Lighthouse: version: v5.3.0-450326c, os_version: aarch64-linux, err: RPC response was an error: Rate limited with reason: Rate limited. There is an active request with the same protocol, protocol: data_column_sidecars_by_range, service: libp2p
This is the test I used (created in another branch forked from this PR). In the test, the sender maintains two (or fewer) active requests at a time, but self rate-limiting is triggered. The test is quite simple, but I think it shows that we need to limit both concurrent requests and (optionally) tokens. What do you think?
The spec PR has been merged.
Let me organize the remaining tasks of this PR :pray: :
- Need to limit concurrent requests on client (outbound requests) side, to comply with the spec update.
- Whether we also need to limit tokens is under duscussion.
Note: Currently, if we run this branch (on a Kurtosis local network), the lighthouse server responds with an error as the client side is exceeding the concurrency limit, resulting in a ban.
I think task 1. needs to be implemented in this PR because of the banning mentioned above. What do you think, @jxs? (Asking because of this comment)
Let me organize the remaining tasks of this PR 🙏 :
1. Need to limit concurrent requests on client (outbound requests) side, to comply with the spec update. 2. Whether we also need to limit tokens is under duscussion.Note: Currently, if we run this branch (on a Kurtosis local network), the lighthouse server responds with an error as the client side is exceeding the concurrency limit, resulting in a ban.
I think task
1.needs to be implemented in this PR because of the banning mentioned above. What do you think, @jxs? (Asking because of this comment)
yeah makes sense Akihito, go for it :rocket:
https://github.com/sigp/lighthouse/actions/runs/12109786536/job/33759312718?pr=5923
The network-tests failure was not reproduced locally. It might be related to https://github.com/sigp/lighthouse/pull/6646.
@jxs @pawanjay176 @AgeManning
This is ready for another review. 🙏
I have added a concurrency limit on the self-lmiter. Now, the self-limiter limits outbound requests based on both the number of concurrent requests and tokens (optional). Whether we also need to limit tokens in the self-limiter is still under duscussion. Let me know if you have any ideas.
(FYI)
I also ran lighthouse (this branch) on the testnet for ~24hours. During this time, the LH node responded with 21 RateLimited errors due to the number of active requests. These errors appear in the logs like the example below. Note that this is about inbound requests, not the self-limiter (outbound requests).
Dec 09 13:38:56.806 DEBG There is an active request with the same protocol, protocol: beacon_blocks_by_range, request: Blocks by range: Start Slot: 2738468, Count: 64, Step: 1, peer_id: 16Uiu2HAmERvtCC321A2Nu1pH6QmhXgvqALxgCnrp4qxr2qeVWr2P, service: libp2p_rpc, service: libp2p, module: lighthouse_network::rpc:491
@pawanjay176 @jxs @dapplion @jimmygchen - If anyone has any spare time, I think this is a good one to get in.
I removed RPC::is_request_size_too_large in https://github.com/sigp/lighthouse/pull/5923/commits/cfea9d2c04802af9c88c4aff681b29f573b0ac24 because such invalid requests are now handled in the Handler, as implemented in the following PRs:
- https://github.com/sigp/lighthouse/pull/6847
- https://github.com/sigp/lighthouse/pull/6986
@jxs @dapplion @pawanjay176 @jimmygchen @AgeManning
In this PR, the rate limiters have been updated in terms of naming and functionality according to the spec update. Here is a table summarizing the changes to the limiters. The current naming might not be ideal, so I hope this will be helpful for your feedback. 🙏
| Previous Naming | Current Naming | What This Does |
|---|---|---|
limiter: Option<RateLimiter> |
response_limiter: Option<ResponseLimiter<E>> |
Previously, this rate-limited inbound requests. Currently, it rate-limits our responses to inbound requests. |
self_limiter: Option<SelfRateLimiter<Id, E>> |
outbound_request_limiter: SelfRateLimiter<Id, E> |
Rate-limits our outbound requests. Previously, it was optional. |
Additionally, a global rate limiter for inbound requests has been proposed to reduce memory cost in worst-case senarios. ( https://github.com/sigp/lighthouse/pull/5923#discussion_r1970446793 )
Ok, in the interest of not getting this thing stale and requiring us to maintain it, how do we all feel about a merge? @jxs @pawanjay176 @dapplion ?
I would like to have this in unstable soon, and work PeerDAS around this rate limit paradigm
I'm in favour of getting this in as well. Will test this for a bit and report back
@pawanjay176 Sorry for the delay.I've addressed your feedback.