bitcoin
bitcoin copied to clipboard
p2p: Make stalling timeout adaptive during IBD
During IBD, there is the following stalling mechanism if we can't proceed with assigning blocks from a 1024 lookahead window because all of these blocks are either already downloaded or in-flight: We'll mark the peer from which we expect the current block that would allow us to advance our tip (and thereby move the 1024 window ahead) as a possible staller. We then give this peer 2 more seconds to deliver a block (BLOCK_STALLING_TIMEOUT
) and if it doesn't, disconnect it and assign the critical block we need to another peer.
Now the problem is that this second peer is immediately marked as a potential staller using the same mechanism and given 2 seconds as well - if our own connection is so slow that it simply takes us more than 2 seconds to download this block, that peer will also be disconnected (and so on...), leading to repeated disconnections and no progress in IBD. This has been described in #9213, and I have observed this when doing IBD on slower connections or with Tor - sometimes there would be several minutes without progress, where all we did was disconnect peers and find new ones.
The 2s
stalling timeout was introduced in #4468, when blocks weren't full and before Segwit increased the maximum possible physical size of blocks - so I think it made a lot of sense back then.
But it would be good to revisit this timeout now.
This PR makes the timout adaptive (idea by sipa): If we disconnect a peer for stalling, we now double the timeout for the next peer (up to a maximum of 64s). If we connect a block, we half it again up to the old value of 2 seconds. That way, peers that are comparatively slower will still get disconnected, but long phases of disconnecting all peers shouldn't happen anymore.
Fixes #9213
Nice observation.
(Brainstorm idea) How about something like doubling the timeout every time it causes a disconnection. And then reducing/resetting it when the window actually moves?
Collect statistics from recent block download times during IBD and have a dynamic timeout based on this. (Introduces more complexity, but might be better in certain situations, e.g. when 6s aren't sufficient either).
Rather than this, it might be better to track download speeds from each peer, and check the speeds of this peer after 2 seconds.
For an immediate fix, though, maybe just making the timeout configurable would be a good idea?
Perhaps as an interim between these two ideas, if we disconnect N stalling peers, start increasing the timeout.
Rather than this, it might be better to track download speeds from each peer, and check the speeds of this peer after 2 seconds.
We would also have to compare it to the speed of others and have some criterion what deviation would be enough to disconnect.
How about something like doubling the timeout every time it causes a disconnection. And then reducing/resetting it when the window actually moves?
Perhaps as an interim between these two ideas, if we disconnect N stalling peers, start increasing the timeout.
Thanks! These suggestions are similar, make a lot of sense to me, and don't look very invasive to implement, planning to change to this approach and update soon.
I now implemented the suggestion by @sipa to double the timeout and updated the OP.
I tested this manually by catching up to the best chain with an ~1 month old datadir with -onlynet=tor
(slow, blocks take ~10s to download), while reducing BLOCK_DOWNLOAD_WINDOW
and MAX_BLOCKS_IN_TRANSIT_PER_PEER
to make stalling situations happen more quickly.
Nice! This seems a fine improvement.
I think one way of looking at stalling is that it happens when one peer's bandwidth is less than 1/64th of the total bandwidth (64 = 1024/16 = window/max in transit) [0].
I think that means there might be a clever way of preventing slow nodes from stalling the download by reducing the in transit limit instead -- so that instead of supplying 16 blocks in the time it takes the other peers to supply 1008 to avoid stalling, the peer only needs to supply 8 or 4 blocks in the time it takes the other peers to supply 1016 or 1020. [1]
I think adding the blocks only nodes probably made this slightly worse, since there are now 2 extra peers, so now you need something like 25% more bandwidth in order to still have 1/64th of the total...
[0] Measured in blocks of course, so even if your bandwidth in bytes is fine, you might be unlucky to be asked for 16 blocks that are 2MB each, while everyone else is just being asked for 1008 50kB blocks at you (32MB total vs 7.2MB per peer).
[1] Perhaps you could implement this by keeping a global and per-peer exponential rolling average of how many blocks you download per second; then you could set the peer's in-transit limit to 1024 * peer_avg / global_avg
; capping it at 16, and marking the peer as stalling and disconnecting if that value drops below 2 (in which case the remaining peers each have 50x this peer's bandwidth)?
I think that means there might be a clever way of preventing slow nodes from stalling the download by reducing the in transit limit instead -- so that instead of supplying 16 blocks in the time it takes the other peers to supply 1008 to avoid stalling, the peer only needs to supply 8 or 4 blocks in the time it takes the other peers to supply 1016 or 1020. [1]
That sounds like a very interesting alternative approach. I'm not sure I understand it completely though: Are you suggesting to assign slower peers less blocks simultaneously, to help prevent stalling situations from occurring in the first place? And also move away from the concept that a stalling situation occurs only when we can't move the 1024 block window forward, but make it dependent on the other peers instead, so that we'd possibly disconnect slow peers much earlier than that if they are slower in comparison to faster ones?
Concept ACK. I will give a code review ACK once you resolve vasil's comments :)
That sounds like a very interesting alternative approach.
Yeah, it shouldn't hold up this fix though, I think.
I'm not sure I understand it completely though: Are you suggesting to assign slower peers less blocks simultaneously, to help prevent stalling situations from occurring in the first place?
Yes.
And also move away from the concept that a stalling situation occurs only when we can't move the 1024 block window forward, but make it dependent on the other peers instead, so that we'd possibly disconnect slow peers much earlier than that if they are slower in comparison to faster ones?
Not really. I think you need to keep the 1024 block window (since increasing that hurts pruning), and I think that if the window gets full you should still call that "stalling".
But I think if you change the MAX_BLOCKS_IN_TRANSIT_PER_PEER
so that slower peers have fewer in-transit blocks, then you'll be stalling much less often, and may not need to disconnect them at all -- that lets you stay connected to tor peers during IBD (for partition/sybil resistance) even if your ipv4/ipv6 peers are much faster. Maybe you could disconnect them when their max_blocks_in_transit
drops to 1 or 2, before they actually cause stalling?
Maybe a more specific example would be worthwhile. As it stands, if your first peer will give you one block every 5 seconds, and your other 9 peers will collectively give you 14 blocks every second (on average, 7.8 times faster than the first peer, in total 70 times faster), then by the time that first peer has downloaded blocks 1..15 (which takes 75 seconds), the other peers will have given you blocks 17..1039 after 73.1 seconds, and stalling gets triggered. But if the slow peer had only queued up 8 blocks, then it would have supplied them in 40 seconds, which only gives the other peers enough time to supply 560 blocks, so they won't fill up the window. Hmm, I guess it ought to be possible to simulate that scenario via the functional test's P2PInterface
stuff...
Concept ACK
I think it would be nice if this PR also added some tests, because it looks like we didn't have any tests for the stalling mechanism in the first place.
Will address feedback soon (and work on adding a test for the stalling logic).
- Addressed review comments: In particular, don't halve the adaptive timeout each time a block is connected, but let it go back to the default value more slowly
- Added a functional test for the stalling logic.
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
Reviews
See the guideline for information on the review process.
Type | Reviewers |
---|---|
ACK | vasild, naumenkogs, achow101 |
Concept ACK | dergoegge, sipa, RandyMcMillan |
Stale ACK | w0xlt, luke-jr |
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
Conflicts
No conflicts as of last run.
utACK 4b0dbc0f3eb8c57944f9037e017b89c912048206
I am ok to drop the test. It is good to have tests to ensure the code works as intended. But we can't have tests for everything and there is a subjective threshold somewhere. If it is too difficult to implement properly or is more complicated than the actual code it tests, then it may be too expensive. There is maintenance cost for the test too. Developers could trash precious time investigating a sporadically failing test, fixing it or trying to figure out whether their (seemingly unrelated) changes broke the test. I am not saying to drop the test, just that I would be ok with that.
Addressed the test feedback (will get to the outstanding comment for the main commit a bit later).
- First test (should not stall): those
received: block...
messages in the log above (from the failure of the second test) are produced from the first test. Thesleep(0.5)
was apparently not enough, so I fixed it tosleep(5)
(just for testing, not to actually have it in the final test). This means that in practice it could stall and remain undetected by the first test because it will be happy to not seeStall started
in the log even though it may be printed shortly after the first test has eagerly declared success. We want to check that there are 1023received: block
messages in the log and that afterwards the stalling logic fromSendMessages()
is executed and after that there is no "Stall started" in the log. I am not sure how to do that. Checking the bytes received for block messages seems to be better than the sleep, but could still end the wait too early.
I rewrote the test such that it doesn't use the log anymore, but waits until all blocks are received, syncs (so that a peer could get mark as a staller), waits for 3s, syncs again (so that a peer could get disconnected), and then checks that no peer gets disconnected.
- Second test (should stall): it fails because there is no "Stall started peer=0" message. I added
sleep(10)
at the end of thewith...
block to wait even more for the stall. Then it fails with this error:
I removed the peer=0
part of the check and added a missing self.all_sync_send_with_ping(peers)
to the with
block. With that, the tests succeeds for me even with some slow sanitizers enabled - will do more runs over the weekend to check for intermittent failures.
I am ok to drop the test. It is good to have tests to ensure the code works as intended. But we can't have tests for everything and there is a subjective threshold somewhere. If it is too difficult to implement properly or is more complicated than the actual code it tests, then it may be too expensive. There is maintenance cost for the test too. Developers could trash precious time investigating a sporadically failing test, fixing it or trying to figure out whether their (seemingly unrelated) changes broke the test. I am not saying to drop the test, just that I would be ok with that.
If everyone agrees that would be ok with me. However, the stalling logic was completely untested before, which is not ideal, so the test doesn't just cover the changes from this PR. @dergoegge do you have an opinion, since you suggested the test? Do you think that the stalling logic could maybe better be covered by a unit test after #25515? (which would have less problems with timeouts).
9339230 to aceff9e: Also addressed the outstanding comments to the main commit (plus minor reformatting of comments) and fixed another source of spurious test failures - thanks for the reviews!
aceff9e to 39b9364: addressed feedback by @vasild - the CI failure is unrelated (I opened #26404 to fix it).
ACK 39b93649c4b98cd82c64b957fd9f6a6fd3c2a359
Strong Concept ACK 39b93649c4b98cd82c64b957fd9f6a6fd3c2a359
Will do some tests ASAP.