Node fails to dial any Tendermint peers
Sometimes during E2E tests on CI some nodes fail to dial any peers. They seem to connect with the seed node, the connection fails with an EOF error (but this seems like something that is part of a seed node's normal operation?), but they fail to dial any other peers.
So it is either that the node doesn't receive any peers from the seed node (but other nodes do) or that it somehow doesn't add them correctly to the address book or the address book is somehow broken and keeps only dialing the seed node.
On the seed node side the following happens (in this case c1120a71e92eb601c69ce119aef109b05db568d0 is the peer that fails to get any peers from the seed node):
{"caller":"service.go:138","impl":"Peer{MConn{127.0.0.1:40622} c1120a71e92eb601c69ce119aef109b05db568d0 in}","level":"info","module":"tendermint:switch","msg":"Starting Peer","peer":"[email protected]:40622","ts":"2020-01-13T10:06:28.412555307Z"}
{"caller":"service.go:138","impl":"MConn{127.0.0.1:40622}","level":"info","module":"tendermint:switch","msg":"Starting MConnection","peer":"[email protected]:40622","ts":"2020-01-13T10:06:28.412574807Z"}
{"caller":"switch.go:786","level":"info","module":"tendermint:switch","msg":"Added peer","peer":"Peer{MConn{127.0.0.1:40622} c1120a71e92eb601c69ce119aef109b05db568d0 in}","ts":"2020-01-13T10:06:28.412681207Z"}
...
{"caller":"connection.go:592","conn":"MConn{127.0.0.1:40622}","err":"read tcp 127.0.0.1:20024-\u003e127.0.0.1:40622: use of closed network connection","level":"error","module":"tendermint:switch","msg":"Connection failed @ recvRoutine (reading byte)","peer":"[email protected]:40622","ts":"2020-01-13T10:06:28.512504268Z"}
{"caller":"service.go:166","impl":"MConn{127.0.0.1:40622}","level":"info","module":"tendermint:switch","msg":"Stopping MConnection","peer":"[email protected]:40622","ts":"2020-01-13T10:06:28.512522068Z"}
{"caller":"switch.go:310","err":"read tcp 127.0.0.1:20024-\u003e127.0.0.1:40622: use of closed network connection","level":"error","module":"tendermint:switch","msg":"Stopping peer for error","peer":"Peer{MConn{127.0.0.1:40622} c1120a71e92eb601c69ce119aef109b05db568d0 in}","ts":"2020-01-13T10:06:28.512532568Z"}
{"caller":"service.go:166","impl":"Peer{MConn{127.0.0.1:40622} c1120a71e92eb601c69ce119aef109b05db568d0 in}","level":"info","module":"tendermint:switch","msg":"Stopping Peer","peer":"[email protected]:40622","ts":"2020-01-13T10:06:28.512553468Z"}
{"caller":"connection.go:323","err":"write tcp 127.0.0.1:20024-\u003e127.0.0.1:40622: use of closed network connection","level":"error","module":"tendermint:switch","msg":"MConnection flush failed","peer":"[email protected]:40622","ts":"2020-01-13T10:06:28.512582668Z"}
And then the thing repeats. Not obvious why this happens.
Probably the next thing to try is to get a dump of network traffic when this is happening (which is annoying as this only happens on CI which eventually leads to a timeout).
what are the logs on the peer side in this case? i assume it could be the "cannot connect to any peers, failing back to seeds"? so this could explain why the seed is redialed?
Is this a response to my "Not obvious why this happens."? Sorry, that should have been said more clearly, the non-obvious part is why the connection fails in such a way.
So yeah repeating is caused by the peer redialing the seed node, but the behavior is exactly the same on each redial.
i'm saying that this seed<->node dialing could be fine (e.g. seed returning the correct peers), but if then for some reason the dialing node couldn't connect to any of the peers returned by the seed node, the redialing would be expected.
the non-obvious part is why the connection fails in such a way
don't the seed node connection logs always look like this?
if then for some reason the dialing node couldn't connect to any of the peers returned by the seed node
In this case it would still update the address book and log when it tries to connect to such peers. This doesn't happen (e.g., the address book remains empty and the peer's only dialing attempts are for the seed node).
don't the seed node connection logs always look like this?
No, for most peers it says "shutting down connection gracefully" (but I think the peer itself still reports EOF). Not sure if this has changed from when the original issue was filed.
So it seems to indicate that the connection with the seed node fails before the peers can be sent.
Getting up to speed on this project and this bug caught my eye. Could it be because the redialing is not being throttled? Would adding a delay between each redial workaround this issue?
Hm, so dialing peers should already be throttled, but the test harness (note that this issue is specifically for E2E tests) spins up many nodes at once so maybe this could run into issues. Not sure if we are still seeing this regularly during E2E tests though.