go-ethereum
go-ethereum copied to clipboard
p2p/discover: bootnodes not retried often enough when table becomes empty
System information
Geth version: a fork of 1.9.8-stable
OS & Version: OSX
Commit hash : d62e9b285777c036c108b89fac0c78f7855ba314
Expected behaviour
A node should repeatedly try using a bootnode for discovery if the bootnode is not accessible at node startup
Actual behaviour
When the bootnode is unreachable at the start of the node, one discoverTask is created to the bootnode and never again. Looking at https://github.com/ethereum/go-ethereum/pull/20573, I'm not sure if this is already a known issue, but this is closely related.
Detailed breakdown:
When starting a discoveryTask, lookupRunning is set to true in this if statement: https://github.com/ethereum/go-ethereum/blob/976a0f5558e20ed7cb7ba2cd68d7429d1ef01db9/p2p/dial.go#L184
So no other discoverTasks can be run at the same time. However, the discoverTask never finishes.
The task begins in Do():
https://github.com/ethereum/go-ethereum/blob/976a0f5558e20ed7cb7ba2cd68d7429d1ef01db9/p2p/dial.go#L325
which calls ReadNodes(): https://github.com/ethereum/go-ethereum/blob/976a0f5558e20ed7cb7ba2cd68d7429d1ef01db9/p2p/enode/iter.go#L36
ReadNodes uses the FairMix iterator, and calls Next(): https://github.com/ethereum/go-ethereum/blob/976a0f5558e20ed7cb7ba2cd68d7429d1ef01db9/p2p/enode/iter.go#L203
Next will block for a value from source.next here: https://github.com/ethereum/go-ethereum/blob/976a0f5558e20ed7cb7ba2cd68d7429d1ef01db9/p2p/enode/iter.go#L218
Which is given by a separate goroutine runSource: https://github.com/ethereum/go-ethereum/blob/976a0f5558e20ed7cb7ba2cd68d7429d1ef01db9/p2p/enode/iter.go#L276
runSource uses another iterator, which is the lookupIterator in this case. I found that the lookupIterator Next() will result in an infinite loop that will case runSource to hang, and subsequently the channel source.next to never receive a value.
The infinite loop happens here https://github.com/ethereum/go-ethereum/blob/976a0f5558e20ed7cb7ba2cd68d7429d1ef01db9/p2p/discover/lookup.go#L209 as it switches between that if statement and the next one for each iteration.
This causes FairMix's Next() to time out, which is expected-- the only bad part is that the timeout calls m.nextFromAny() which reads from m.fromAny here: https://github.com/ethereum/go-ethereum/blob/976a0f5558e20ed7cb7ba2cd68d7429d1ef01db9/p2p/enode/iter.go#L241, which will never be written to in runSource https://github.com/ethereum/go-ethereum/blob/976a0f5558e20ed7cb7ba2cd68d7429d1ef01db9/p2p/enode/iter.go#L283 because of that infinite loop in its iterator Next() call.
Steps to reproduce the behaviour
Start a node with --bootnode pointing to a bootnode that is inaccessible, then make the bootnode accessible & the node will not send any discoverTasks to the bootnode.
This is definitely known, but it's also expected behavior. Discovery cannot work without bootstrapping and the lookup iterator will block until the table contains some nodes to do the lookup with.
We have a rewrite of the dialer coming up in #20592, but that PR mostly improves the latency of acting on discovery results. If there are no discovery results, no peer connections can be created.
Maybe I am misunderstanding your issue. AFAIK the lookupIterator will try to access the bootnodes repeatedly if they do not work. Please provide log output of geth --vmodule=p2p/*=5. The logs will show if that's the case.
Thanks for the response. I see dialTasks to the bootnode are attempted, but not discoverTasks apart from the initial one.
Here are the logs you asked for when running
geth --vmodule=p2p/*=5 --bootnodes=enode://f1acc5e79e47b26a3257697ae9640662fd8e01f3edcf590676132bc12a653d04916f269a075063be79ab0079271882688e478c51ded4e9a63aadc72f454b8c37@35.247.9.127:0?discport=30301 when the bootnode is unreachable. I started up the bootnode (to be clear, it's cmd/bootnode, not a full node) mid-run (I show where in the logs), and nothing is indicated on my node or bootnode that there have been discovery v4 messages. When I re-run geth once the bootnode is already up, everything works as expected.
I think I see what you're saying about lookupIterator, do you mean that https://github.com/ethereum/go-ethereum/blob/2c37142d2f177d0cccc114486040f0913a3900c7/p2p/discover/lookup.go#L194 should result in a call to https://github.com/ethereum/go-ethereum/blob/2c37142d2f177d0cccc114486040f0913a3900c7/p2p/discover/v4_udp.go#L418?
I haven't taken time to look why that wasn't working, but it feels to me like it wasn't working as I expect it should because I saw an infinite loop between https://github.com/ethereum/go-ethereum/blob/2c37142d2f177d0cccc114486040f0913a3900c7/p2p/discover/lookup.go#L193 and https://github.com/ethereum/go-ethereum/blob/2c37142d2f177d0cccc114486040f0913a3900c7/p2p/discover/lookup.go#L197
I can try to look more into what the issue is there when I have some time
Thank you for the logs. It looks like the part that's broken is that newRandomLookup doesn't retry re-inserting the bootnodes when the discovery table is empty.
@fjl I was looking at this and thought that it should have been fixed by #21396 (though it doesn't reinsert the bootnoodes), since that change means the bootnode won't be dropped there if it's the only node in the table (which is what happens in this scenario). But I found that it didn't fix the issue, and after looking into it I saw (from the logs saying Removed dead node) that the bootnode is being removed from the table as part of doRevalidate() (which runs every 10 seconds): https://github.com/ethereum/go-ethereum/blob/fb2c79df1995b4e8dfe79f9c75464d29d23aaaf4/p2p/discover/table.go#L348-L354
I figured it would make sense to change this, to go along with the change in #21396, and I could submit a PR for it, but I wasn't sure what the exact thing to do is, whether it's to avoid removing it (a) if it's the only one in the bucket and there are no replacements, (b) if the bucket is less than half-full as in #21396 and there are no replacements, or (c) something else. So I thought I'd ask to see what your thoughts on it were before proceeding. Thanks!
I have been noticing a related bug which might have similar root cause to this one but might not. Setup (done on a Windows 10 development machine):
- Start a node to act as a bootnode on a nonstandard local chain and copy the enode address.
- With that first node running, in a separate terminal (PowerShell) start another with the --bootnodes parameter as the quoted enode address from the prior step.
- The second node will connect sync, etc. and act normally as expected. Let this run for a while.
- Hibernate the machine overnight.
- Bring the computer out of hibernation.
- The second node will no longer have any peers/connection to the original bootnode, and will not be able to discover it.
- This will continue indefinitely.
- Use CTRL+C to stop the second node, only.
- Restart the second node using the exact same command as before.
- The second node will again connect to the first and do catch-up resync, then stay connected until the next hibernate.
Expected behavior: The enode address is retained. If the second node loses access to the bootnode and has no remaining peers, it'll try the bootnode again, just like it did at startup.
Is that the same issue as this, or should it be spun off as a separate one?
Any thoughts on pulling in oneeman's fix or response to the question above?
From my perspective (not a geth maintainer) option (b) seems to make the most sense - I'd want the bootnodes specified at startup to remain in the list and continue to be retried even if they are occasionally offline, but we don't want the table filled up completely with dead bootnodes while other live nodes are available on the network (unless --nodiscover is also used, in which case the potential harm from a table full of dead nodes seems like less of an issue).