nimbus-eth2
nimbus-eth2 copied to clipboard
Flooded with "Local timer is broken or peer's status information is invalid topics" log
Describe the bug I've several times hit the scenario where one of my nodes gets stuck in following loop, on pyrmont.
WRN 2020-11-19 15:56:02.879+00:00 Local timer is broken or peer's status information is invalid topics="beacnde" tid=12959 file=sync_manager.nim:748 wall_clock_slot=8379 remote_head_slot=65947 local_head_slot=560 peer=16*3QXDwx index=9 tolerance_value=0 peer_speed=624.0 peer_score=50
WRN 2020-11-19 15:56:02.880+00:00 Local timer is broken or peer's status information is invalid topics="beacnde" tid=12959 file=sync_manager.nim:748 wall_clock_slot=8379 remote_head_slot=65947 local_head_slot=560 peer=16*3QXDwx index=9 tolerance_value=0 peer_speed=624.0 peer_score=50
WRN 2020-11-19 15:56:02.880+00:00 Local timer is broken or peer's status information is invalid topics="beacnde" tid=12959 file=sync_manager.nim:748 wall_clock_slot=8379 remote_head_slot=65947 local_head_slot=560 peer=16*3QXDwx index=9 tolerance_value=0 peer_speed=624.0 peer_score=50
WRN 2020-11-19 15:56:02.880+00:00 Local timer is broken or peer's status information is invalid topics="beacnde" tid=12959 file=sync_manager.nim:748 wall_clock_slot=8379 remote_head_slot=65947 local_head_slot=560 peer=16*3QXDwx index=9 tolerance_value=0 peer_speed=624.0 peer_score=50
WRN 2020-11-19 15:56:02.880+00:00 Local timer is broken or peer's status information is invalid topics="beacnde" tid=12959 file=sync_manager.nim:748 wall_clock_slot=8379 remote_head_slot=65947 local_head_slot=560 peer=16*3QXDwx index=9 tolerance_value=0 peer_speed=624.0 peer_score=50
These logs repeat over and over, doesn't look like the node is doing anything else anymore.
Strange thing is that I've seen this only on a Raspberry Pi so far. Time is set correctly on the device, and as you can see, the remote_head_slot is way to high.
rev 771c1d092aaf5eac062414feafc59c7a1ef121b2
Update: Got it with debug logs now, partial grep on the short peer id:
{"lvl":"DBG","ts":"2020-11-19 20:12:44.778+00:00","msg":"created new pubsub peer","topics":"pubsub","tid":15981,"file":"pubsub.nim:152","peer":"16*KjgfKo"}
{"lvl":"DBG","ts":"2020-11-19 20:12:44.779+00:00","msg":"Peer upgraded","topics":"networking","tid":15981,"file":"eth2_network.nim:883","peer":"16Uiu2HAkwcnvyzeqv5BXVMiPUUhrTfvi3UYzt3y6muTAoiKjgfKo","connections":1}
{"lvl":"DBG","ts":"2020-11-19 20:12:44.779+00:00","msg":"Peer connected","topics":"sync","tid":15981,"file":"sync_protocol.nim:94","peer":"16*KjgfKo","peerInfo":{"peerId":"16Uiu2HAkwcnvyzeqv5BXVMiPUUhrTfvi3UYzt3y6muTAoiKjgfKo","addrs":["/ip4/3.236.236.3/tcp/9000/p2p/16Uiu2HAkwcnvyzeqv5BXVMiPUUhrTfvi3UYzt3y6muTAoiKjgfKo"],"protocols":["/ipfs/ping/1.0.0","/meshsub/1.1.0","/meshsub/1.0.0"],"protoVersion":"ipfs/0.1.0","agentVersion":"teku/teku/v0.12.14-dev-943e901c/linux-x86_64/oracle_openjdk-java-14"},"incoming":true}
{"lvl":"DBG","ts":"2020-11-19 20:12:45.657+00:00","msg":"starting pubsub read loop","topics":"pubsubpeer","tid":15981,"file":"pubsubpeer.nim:96","conn":"16*KjgfKo:5fb6d1bc45043d546df1c7bd","peer":"16*KjgfKo","closed":false}
{"lvl":"DBG","ts":"2020-11-19 20:12:46.402+00:00","msg":"decodeMsg: decoded message","topics":"identify","tid":15981,"file":"identify.nim:91","pubkey":"secp25...53])))","addresses":"@[/ip4/3.236.236.3/tcp/9000/p2p/16Uiu2HAkwcnvyzeqv5BXVMiPUUhrTfvi3UYzt3y6muTAoiKjgfKo]","protocols":"@[\"/ipfs/ping/1.0.0\", \"/meshsub/1.1.0\", \"/meshsub/1.0.0\"]","observable_address":"Some(/ip4/82.248.90.211/tcp/9001)","proto_version":"Some(\"ipfs/0.1.0\")","agent_version":"Some(\"teku/teku/v0.12.14-dev-943e901c/linux-x86_64/oracle_openjdk-java-14\")"}
{"lvl":"DBG","ts":"2020-11-19 20:12:46.416+00:00","msg":"starting pubsub read loop","topics":"pubsubpeer","tid":15981,"file":"pubsubpeer.nim:96","conn":"16*KjgfKo:5fb6d1bc45043d546df1c7ba","peer":"16*KjgfKo","closed":false}
{"lvl":"DBG","ts":"2020-11-19 20:12:46.558+00:00","msg":"Irrelevant peer","topics":"sync","tid":15981,"file":"sync_protocol.nim:238","peer":"16*KjgfKo","theirStatus":{"forkDigest":"979baf67","finalizedRoot":"cd4d9434","finalizedEpoch":2099,"headRoot":"cbfacefd","headSlot":67263},"ourStatus":{"forkDigest":"3b088795","finalizedRoot":"548f04e0","finalizedEpoch":135,"headRoot":"480395fe","headSlot":4395}}
{"lvl":"DBG","ts":"2020-11-19 20:12:46.925+00:00","msg":"Irrelevant peer","topics":"sync","tid":15981,"file":"sync_protocol.nim:238","peer":"16*KjgfKo","theirStatus":{"forkDigest":"979baf67","finalizedRoot":"cd4d9434","finalizedEpoch":2099,"headRoot":"cbfacefd","headSlot":67263},"ourStatus":{"forkDigest":"3b088795","finalizedRoot":"548f04e0","finalizedEpoch":135,"headRoot":"480395fe","headSlot":4395}}
{"lvl":"DBG","ts":"2020-11-19 20:12:47.682+00:00","msg":"Peer's syncing status","topics":"beacnde","tid":15981,"file":"sync_manager.nim:741","wall_clock_slot":9663,"remote_head_slot":67254,"local_head_slot":4395,"peer_score":50,"peer":"16*KjgfKo","index":13,"peer_speed":523.0}
{"lvl":"WRN","ts":"2020-11-19 20:12:47.682+00:00","msg":"Local timer is broken or peer's status information is invalid","topics":"beacnde","tid":15981,"file":"sync_manager.nim:748","wall_clock_slot":9663,"remote_head_slot":67254,"local_head_slot":4395,"peer":"16*KjgfKo","index":13,"tolerance_value":0,"peer_speed":523.0,"peer_score":50}
{"lvl":"DBG","ts":"2020-11-19 20:12:47.682+00:00","msg":"Peer's syncing status","topics":"beacnde","tid":15981,"file":"sync_manager.nim:741","wall_clock_slot":9663,"remote_head_slot":67254,"local_head_slot":4395,"peer_score":50,"peer":"16*KjgfKo","index":13,"peer_speed":523.0}
{"lvl":"WRN","ts":"2020-11-19 20:12:47.683+00:00","msg":"Local timer is broken or peer's status information is invalid","topics":"beacnde","tid":15981,"file":"sync_manager.nim:748","wall_clock_slot":9663,"remote_head_slot":67254,"local_head_slot":4395,"peer":"16*KjgfKo","index":13,"tolerance_value":0,"peer_speed":523.0,"peer_score":50}
So the strange thing here is that it is a peer of a wrong network, see Irrelevant peer log, yet still we end up using it for syncing, which shouldn't occur as a disconnect call is done after the Irrelevant peer log.
Adding doAssert(not peer.network.switch.isConnected(peer.info.peerId), "Peer should have been dropped!" on the line after https://github.com/status-im/nimbus-eth2/blob/master/beacon_chain/sync_protocol.nim#L239 gets triggered:
Error: unhandled exception: /home/nimbus/nimbus-eth2/beacon_chain/sync_protocol.nim(241, 13) `not peer.network.switch.isConnected(peer.info.peerId)` Peer should have been dropped! [AssertionError]
Adding it as a log line instead (warn "Is the peer still connected?", connected=peer.network.switch.isConnected(peer.info.peerId), peer), to see the further course of events:
{"lvl":"DBG","ts":"2020-11-20 10:34:05.028+00:00","msg":"created new pubsub peer","topics":"pubsub","tid":28076,"file":"pubsub.nim:152","peer":"16*KjgfKo"}
{"lvl":"DBG","ts":"2020-11-20 10:34:05.029+00:00","msg":"Peer upgraded","topics":"networking","tid":28076,"file":"eth2_network.nim:883","peer":"16Uiu2HAkwcnvyzeqv5BXVMiPUUhrTfvi3UYzt3y6muTAoiKjgfKo","connections":1}
{"lvl":"DBG","ts":"2020-11-20 10:34:05.029+00:00","msg":"Peer connected","topics":"sync","tid":28076,"file":"sync_protocol.nim:94","peer":"16*KjgfKo","peerInfo":{"peerId":"16Uiu2HAkwcnvyzeqv5BXVMiPUUhrTfvi3UYzt3y6muTAoiKjgfKo","addrs":["/ip4/3.236.236.3/tcp/9000/p2p/16Uiu2HAkwcnvyzeqv5BXVMiPUUhrTfvi3UYzt3y6muTAoiKjgfKo"],"protocols":["/ipfs/ping/1.0.0","/meshsub/1.1.0","/meshsub/1.0.0"],"protoVersion":"ipfs/0.1.0","agentVersion":"teku/teku/v0.12.14-dev-943e901c/linux-x86_64/oracle_openjdk-java-14"},"incoming":true}
{"lvl":"DBG","ts":"2020-11-20 10:34:05.975+00:00","msg":"starting pubsub read loop","topics":"pubsubpeer","tid":28076,"file":"pubsubpeer.nim:96","conn":"16*KjgfKo:5fb79b9d6ff7e1542c122342","peer":"16*KjgfKo","closed":false}
{"lvl":"DBG","ts":"2020-11-20 10:34:06.121+00:00","msg":"decodeMsg: decoded message","topics":"identify","tid":28076,"file":"identify.nim:91","pubkey":"secp25...53])))","addresses":"@[/ip4/3.236.236.3/tcp/9000/p2p/16Uiu2HAkwcnvyzeqv5BXVMiPUUhrTfvi3UYzt3y6muTAoiKjgfKo]","protocols":"@[\"/ipfs/ping/1.0.0\", \"/meshsub/1.1.0\", \"/meshsub/1.0.0\"]","observable_address":"Some(/ip4/82.248.90.211/tcp/9001)","proto_version":"Some(\"ipfs/0.1.0\")","agent_version":"Some(\"teku/teku/v0.12.14-dev-943e901c/linux-x86_64/oracle_openjdk-java-14\")"}
{"lvl":"DBG","ts":"2020-11-20 10:34:06.198+00:00","msg":"starting pubsub read loop","topics":"pubsubpeer","tid":28076,"file":"pubsubpeer.nim:96","conn":"16*KjgfKo:5fb79b9d6ff7e1542c122346","peer":"16*KjgfKo","closed":false}
{"lvl":"DBG","ts":"2020-11-20 10:34:06.796+00:00","msg":"Irrelevant peer","topics":"sync","tid":28076,"file":"sync_protocol.nim:238","peer":"16*KjgfKo","theirStatus":{"forkDigest":"979baf67","finalizedRoot":"f9425830","finalizedEpoch":2234,"headRoot":"c37f4cb5","headSlot":71569},"ourStatus":{"forkDigest":"3b088795","finalizedRoot":"00000000","finalizedEpoch":0,"headRoot":"891fcd12","headSlot":63}}
{"lvl":"WRN","ts":"2020-11-20 10:34:06.796+00:00","msg":"Is the peer still connected?","topics":"sync","tid":28076,"file":"sync_protocol.nim:241","connected":true,"peer":"16*KjgfKo"}
{"lvl":"DBG","ts":"2020-11-20 10:34:07.005+00:00","msg":"Irrelevant peer","topics":"sync","tid":28076,"file":"sync_protocol.nim:238","peer":"16*KjgfKo","theirStatus":{"forkDigest":"979baf67","finalizedRoot":"f9425830","finalizedEpoch":2234,"headRoot":"c37f4cb5","headSlot":71569},"ourStatus":{"forkDigest":"3b088795","finalizedRoot":"00000000","finalizedEpoch":0,"headRoot":"891fcd12","headSlot":63}}
{"lvl":"WRN","ts":"2020-11-20 10:34:07.005+00:00","msg":"Is the peer still connected?","topics":"sync","tid":28076,"file":"sync_protocol.nim:241","connected":true,"peer":"16*KjgfKo"}
{"lvl":"DBG","ts":"2020-11-20 10:34:07.052+00:00","msg":"Peer's syncing status","topics":"beacnde","tid":28076,"file":"sync_manager.nim:741","wall_clock_slot":13970,"remote_head_slot":71568,"local_head_slot":63,"peer_score":50,"peer":"16*KjgfKo","index":10,"peer_speed":578.0}
{"lvl":"WRN","ts":"2020-11-20 10:34:07.053+00:00","msg":"Local timer is broken or peer's status information is invalid","topics":"beacnde","tid":28076,"file":"sync_manager.nim:748","wall_clock_slot":13970,"remote_head_slot":71568,"local_head_slot":63,"peer":"16*KjgfKo","index":10,"tolerance_value":0,"peer_speed":578.0,"peer_score":50}
{"lvl":"DBG","ts":"2020-11-20 10:34:07.053+00:00","msg":"Peer's syncing status","topics":"beacnde","tid":28076,"file":"sync_manager.nim:741","wall_clock_slot":13970,"remote_head_slot":71568,"local_head_slot":63,"peer_score":50,"peer":"16*KjgfKo","index":10,"peer_speed":578.0}
{"lvl":"WRN","ts":"2020-11-20 10:34:07.053+00:00","msg":"Local timer is broken or peer's status information is invalid","topics":"beacnde","tid":28076,"file":"sync_manager.nim:748","wall_clock_slot":13970,"remote_head_slot":71568,"local_head_slot":63,"peer":"16*KjgfKo","index":10,"tolerance_value":0,"peer_speed":578.0,"peer_score":50}
So the assumption that we hit this issue because an invalid peer connects to us and doesn't get properly disconnected seems to hold. @dryajov
So, from further placing logs in https://github.com/status-im/nimbus-eth2/blob/master/beacon_chain/eth2_network.nim#L424, I learned that it multiple disconnects are going on (from what I see, probably due to the fact that the peer connects again right after the first?), however, because the peer is already in Disconnecting or Disconnected state, it doesn't run the switch disconnect. Not fully sure of this, I'd have to further debug.
However, I noticed fixes were done here https://github.com/status-im/nimbus-eth2/pull/2065, so I'll retest that first.
Aside from the issues with disconnect races, I do wonder the following: When disconnect would be working correctly, but if we would have a malicious peer, that gives the correct information at status/handshake message, but then just continuously feeds bogus data similarly like this peer from another network, then it looks like the peer won't get disconnected because of decreasing peer score (as I don't see this happen in the logs). I understand it is difficult, as it might be also a clock issue on our own side, but something should be done here not to overwhelm the node?
While the original cause of this issue has been resolved (connection to a peer from another network), I'm keeping this issue open as the actual flooding is not resolved.
This is also shown by the reporting of issue https://github.com/status-im/nimbus-eth2/issues/2218 , which suggest to investigate the current tolerance value.
Additionally I'd still suggest to decrease peer score and eventually disconnect them, unless someone has better ideas.
In the case that it was indeed the remote peer its clock that was wrong, great, disconnect was the correct action.
In the case that it is/was the local node its clock that is/was wrong. WRN or ERR message logged (until disconnect). If clock persists to be wrong, other peers will also disconnect, and eventually no or very low amount of peers will be available. Which should trigger alarms anyhow, and logs would indicate the reason clearly. So a user can debug and notice there are clock sync issues. If the clock persists to be out of sync, the node can't operate properly anyhow, so it is better to notice this fast.
Is this still happening?
This was addressed in 118840d2410d68b475365ed01a404fbb6c15b607 which descores such peers eventually disconnecting them