go-spacemesh
go-spacemesh copied to clipboard
p2p: eventually evict disconnected peer from address book
Description
during longevity testing, p2p would emit error logs for nodes that we replaced during tests. an error log is printed for each disconnected peer every X seconds where X ranges from 10-90 seconds. and this is for every peer the node lost connection ever. this produced ~500 error logs every 10 min and lasted till the end of the test.
it would appear that disconnected peers are not removed from the address book.
log snippet
2022-08-01 12:40:05
{"L":"ERROR","T":"2022-08-01T15:40:05.955-0400","N":"b4141.p2p ","M":"failed to connect to peer","node_id":"b414165507ca0823b2fd71c9bad03a15732b8c1b9db47c0a93a40f020bc53ed0","module":"p2p","peer":"12D3KooWHfVBhsnXsCEXEA8CjU1Pn8vWVDbppiKzuxRMem1s3cdu","errmsg":"no good addresses","name":"p2p"}
2022-08-01 12:40:05
{"L":"ERROR","T":"2022-08-01T15:40:05.957-0400","N":"b4141.p2p ","M":"failed to connect to peer","node_id":"b414165507ca0823b2fd71c9bad03a15732b8c1b9db47c0a93a40f020bc53ed0","module":"p2p","peer":"12D3KooWSofbFHnoyNkmgVMTQ5bjmuNsEuaHwesFU3q2D8MqvUGp","errmsg":"failed to dial 12D3KooWSofbFHnoyNkmgVMTQ5bjmuNsEuaHwesFU3q2D8MqvUGp:\n * [/ip4/10.0.1.20/tcp/7513] dial backoff","name":"p2p"}
2022-08-01 12:40:05
{"L":"ERROR","T":"2022-08-01T15:40:05.957-0400","N":"b4141.p2p ","M":"failed to connect to peer","node_id":"b414165507ca0823b2fd71c9bad03a15732b8c1b9db47c0a93a40f020bc53ed0","module":"p2p","peer":"12D3KooWMK8uyp8dJjXNn8mpEN3fvjoLwQSJtmv3c7xpfHfm1qWa","errmsg":"no good addresses","name":"p2p"}
2022-08-01 12:40:05
{"L":"ERROR","T":"2022-08-01T15:40:05.959-0400","N":"b4141.p2p ","M":"failed to connect to peer","node_id":"b414165507ca0823b2fd71c9bad03a15732b8c1b9db47c0a93a40f020bc53ed0","module":"p2p","peer":"12D3KooWBtKW2hD2T2MdviXJ5qWqaX4ZbnVfLcbrJrZoF2Twm6cL","errmsg":"failed to dial 12D3KooWBtKW2hD2T2MdviXJ5qWqaX4ZbnVfLcbrJrZoF2Twm6cL:\n * [/ip4/10.0.1.11/tcp/7513] dial backoff","name":"p2p"}
2022-08-01 12:40:05
{"L":"ERROR","T":"2022-08-01T15:40:05.960-0400","N":"b4141.p2p ","M":"failed to connect to peer","node_id":"b414165507ca0823b2fd71c9bad03a15732b8c1b9db47c0a93a40f020bc53ed0","module":"p2p","peer":"12D3KooWGXxqnC1kxzk6LSXo8KkWkM2bUUwFMETcHQBbavvqj6AY","errmsg":"failed to dial 12D3KooWGXxqnC1kxzk6LSXo8KkWkM2bUUwFMETcHQBbavvqj6AY:\n * [/ip4/10.0.1.7/tcp/7513] dial backoff","name":"p2p"}
2022-08-01 12:40:05
{"L":"ERROR","T":"2022-08-01T15:40:05.960-0400","N":"b4141.p2p ","M":"failed to connect to peer","node_id":"b414165507ca0823b2fd71c9bad03a15732b8c1b9db47c0a93a40f020bc53ed0","module":"p2p","peer":"12D3KooWCdmeu4GwP36L9dgx15rADhANX6uR9ek8TemKXFLXY9Dw","errmsg":"failed to dial 12D3KooWCdmeu4GwP36L9dgx15rADhANX6uR9ek8TemKXFLXY9Dw:\n * [/ip4/10.0.2.23/tcp/7513] dial backoff","name":"p2p"}
2022-08-01 12:40:05
{"L":"ERROR","T":"2022-08-01T15:40:05.960-0400","N":"b4141.p2p ","M":"failed to connect to peer","node_id":"b414165507ca0823b2fd71c9bad03a15732b8c1b9db47c0a93a40f020bc53ed0","module":"p2p","peer":"12D3KooWCThvyPGHe5mUTPWanCNRCX12ieRZiEPKF6uLYgZJYS3t","errmsg":"failed to dial 12D3KooWCThvyPGHe5mUTPWanCNRCX12ieRZiEPKF6uLYgZJYS3t:\n * [/ip4/10.0.2.9/tcp/7513] dial backoff","name":"p2p"}
2022-08-01 12:40:05
{"L":"ERROR","T":"2022-08-01T15:40:05.965-0400","N":"b4141.p2p ","M":"failed to connect to peer","node_id":"b414165507ca0823b2fd71c9bad03a15732b8c1b9db47c0a93a40f020bc53ed0","module":"p2p","peer":"12D3KooWEXqEv6UoU2edY6gKz7ugpSy5SKXohaCeGC1FXkAXjSKW","errmsg":"failed to dial 12D3KooWEXqEv6UoU2edY6gKz7ugpSy5SKXohaCeGC1FXkAXjSKW:\n * [/ip4/10.0.2.12/tcp/7513] dial backoff","name":"p2p"}
2022-08-01 12:40:10
{"L":"ERROR","T":"2022-08-01T15:40:10.966-0400","N":"b4141.p2p ","M":"failed to connect to peer","node_id":"b414165507ca0823b2fd71c9bad03a15732b8c1b9db47c0a93a40f020bc53ed0","module":"p2p","peer":"12D3KooWGMaAHMoVB6NjymmZnT5Sny1AF5dyqd1TYMrX8tLYZhzk","errmsg":"failed to dial 12D3KooWGMaAHMoVB6NjymmZnT5Sny1AF5dyqd1TYMrX8tLYZhzk:\n * [/ip4/10.0.1.24/tcp/7513] dial tcp4 0.0.0.0:7513->10.0.1.24:7513: i/o timeout","name":"p2p"}
2022-08-01 12:40:10
{"L":"ERROR","T":"2022-08-01T15:40:10.967-0400","N":"b4141.p2p ","M":"failed to connect to peer","node_id":"b414165507ca0823b2fd71c9bad03a15732b8c1b9db47c0a93a40f020bc53ed0","module":"p2p","peer":"12D3KooWPXBnzFxVtRh1PfQTiisw8t7teGoTkq9Nogf9qi8Vb9F2","errmsg":"failed to dial 12D3KooWPXBnzFxVtRh1PfQTiisw8t7teGoTkq9Nogf9qi8Vb9F2:\n * [/ip4/10.0.2.19/tcp/7513] dial backoff","name":"p2p"}
2022-08-01 12:40:15
{"L":"ERROR","T":"2022-08-01T15:40:15.957-0400","N":"b4141.p2p ","M":"failed to connect to peer","node_id":"b414165507ca0823b2fd71c9bad03a15732b8c1b9db47c0a93a40f020bc53ed0","module":"p2p","peer":"12D3KooWHfVBhsnXsCEXEA8CjU1Pn8vWVDbppiKzuxRMem1s3cdu","errmsg":"no good addresses","name":"p2p"}
2022-08-01 12:40:15
{"L":"ERROR","T":"2022-08-01T15:40:15.957-0400","N":"b4141.p2p ","M":"failed to connect to peer","node_id":"b414165507ca0823b2fd71c9bad03a15732b8c1b9db47c0a93a40f020bc53ed0","module":"p2p","peer":"12D3KooWMK8uyp8dJjXNn8mpEN3fvjoLwQSJtmv3c7xpfHfm1qWa","errmsg":"no good addresses","name":"p2p"}
screenshot at the beginning of the TestStepReplaceNodes.

screenshot at the end of the test.

works items:
- instrument address book to track how many addresses were removed (total addresses gauge, remove addresses counter, maybe a hist to track how long it takes to remove a peer after it failed connection the first time)
- based on the above metrics evaluate the efficiency of the current eviction logic using basic longevity test
- come up with a better heuristic for evicting peers that are in unknown state (e.g if a peer gave already dead addresses)
if node fails to connect in periodic check - not log.Error, just log.Info
failed to check nodes: context deadline exceeded
Summary of my observations so far:
When the bookchecker queries for peers that are unavailable for X period of time, it checks the LastSeen property of a knownAddress. This property is periodically updated by the crawler in crawler::query() when it receives a list of peers known to the other node.
When I changed the bookchecker to query unavailable peers by looking at the LastSuccess property instead, it started to evict stale peers properly. However, they are soon added back to the address book by the aforementioned logic of the crawler.
I need to think about how to avoid re-adding just evicted peers back to the address book. @dshulyak suggested that nodes should not broadcast peer addresses that are stale for some time (LastSuccess older than X time).