Use peer ratings to determine if peer should be disconnected
We currently maintain a rating per peer (Peer::m_rating) which starts at 0 and is adjusted in response to activity with that peer. We make positive adjustments if the peer provides value during syncing and negative adjustments if the peer doesn't provide value . For example, we add 100 to a peer's rating if we successfully import one of their transactions and subtract 100 if their transaction is malformed: https://github.com/ethereum/aleth/blob/71124053cca611a7d3f3a3c987edb45a606a2955/libethereum/EthereumCapability.cpp#L606-L620
We currently only use this rating to determine the order in which we iterate over peers when syncing - from Host::foreEachPeer: https://github.com/ethereum/aleth/blob/71124053cca611a7d3f3a3c987edb45a606a2955/libp2p/Host.cpp#L1097-L1104
We should also use this rating to determine if we should disconnect from a peer.
Note that we also use the rating updates to determine how long we should wait until we reconnect to a peer that we were disconnected from.
Here's the logic that Peer uses to determine if we should reconnect to the associated node: https://github.com/ethereum/aleth/blob/71124053cca611a7d3f3a3c987edb45a606a2955/libp2p/Peer.cpp#L46-L49
Peer::fallbackSeconds() takes into account m_failedAttempts when computing the amount of time to wait: https://github.com/ethereum/aleth/blob/71124053cca611a7d3f3a3c987edb45a606a2955/libp2p/Peer.cpp#L51-L73
m_failedAttempts is reset to 0 in Peer::noteSessionGood()...Peer::noteSessionGood() is called on a peer rating update when the update amount is >= 0: https://github.com/ethereum/aleth/blob/71124053cca611a7d3f3a3c987edb45a606a2955/libp2p/Session.cpp#L87-L96
We saw already that we sometimes keep asking the peer for the same block many times despite its NoMoreHeaders answer. Its rating is already decreased in this case, but it continues indefinitely, because low rating never leads to disconnect.
I've just tried to sync Ropsten with your latest changes (making bootnodes not required and removing useless peers), and can confirm that it's still the case.
DEBUG 06-18 15:33:56 p2p peer Requesting 1024 block headers starting from 510260 from ##865a6325… 865a6325…
DEBUG 06-18 15:33:56 p2p sync BlocksHeaders (0 entries) : NoMoreHeaders from ##865a6325…
DEBUG 06-18 15:33:56 p2p peer Requesting 1024 block headers starting from 510260 from ##865a6325… 865a6325…
DEBUG 06-18 15:33:56 p2p sync BlocksHeaders (0 entries) : NoMoreHeaders from ##865a6325…
DEBUG 06-18 15:33:56 p2p peer Requesting 1024 block headers starting from 510260 from ##865a6325… 865a6325…
DEBUG 06-18 15:33:57 p2p sync BlocksHeaders (0 entries) : NoMoreHeaders from ##865a6325…
DEBUG 06-18 15:33:57 p2p peer Requesting 1024 block headers starting from 510260 from ##865a6325… 865a6325…
DEBUG 06-18 15:33:58 p2p sync BlocksHeaders (0 entries) : NoMoreHeaders from ##865a6325…
DEBUG 06-18 15:33:58 p2p peer Requesting 1024 block headers starting from 510260 from ##865a6325… 865a6325…
DEBUG 06-18 15:33:58 p2p sync BlocksHeaders (0 entries) : NoMoreHeaders from ##865a6325…
DEBUG 06-18 15:33:58 p2p peer Requesting 1024 block headers starting from 510260 from ##865a6325… 865a6325…
DEBUG 06-18 15:33:58 p2p sync BlocksHeaders (0 entries) : NoMoreHeaders from ##865a6325…
This is Parity bootnode and I guess we connect to it once and then never disconnect.
Also from the perspective of the user with default verbosity this looks weird, because it reports having 1 active peer, but the sync doesn't progress.
@gumb0 Yea, what you're seeing makes sense since we can successfully connect and peer with the Parity bootnode (my changes won't have any affect given that they only impact nodes that we can't successfully connect to or which get disconnected from).
I think I'm going to take a look at what other clients do with respect to disconnecting from nodes which aren't proving useful during sync (e.g. aren't providing us with blocks, providing us with the wrong blocks) so I can come up with a plan for when we should disconnect from these nodes.
I've taken a look at the Geth and Parity code - long story short, neither of them have a reputation system and simply disconnect peers which misbehave (and Parity has a timer which prevents itself from reconnecting to a misbehaving peer for up to 5 minutes).
Here are the details of my Parity investigation:
Parity seems to do one of two things to a badly behaving peer - disconnect or disable it. Disconnecting a peer involves sending a p2p capability disconnect packet to it whereas disabling a peer involves both disconnecting from the peer and marking it as useless (which prevents a reconnection for up to 5 minutes). I asked the Parity folks why they didn't disable peers permanently since some of the disable cases seem critical (e.g. peer is on different network) and wouldn't resolve in 5 minutes and they said that it's difficult to tell the difference between a malicious node and one that's on the wrong fork or just out-of-sync, which makes sense. Additionally, in practice it should take longer than 5 minutes before reconnecting to a peer since the disabled peer is put back into a pool of potential peers to reconnect to.
Disconnect / disable maps to a concept of punishments, where a punishment is determined based on the type of error which occurred (e.g. wrong headers, protocol version mismatch). There are currently only two punishments - disconnect or disable (but additional punishments could be added as a part of a reputation system in the future).
Details
Disabling a peer (disable_peer) and disconnecting a peer (disconnect_peer) are implemented here: https://github.com/paritytech/parity-ethereum/blob/5dc5be1e58a2585d3d32f6dd4453789c605e3a00/util/network-devp2p/src/host.rs#L1133-L1142
Note that disable_peer also marks the node as useless by inserting its id into useless_nodes - the contents of useless_nodes is used as a filter when retrieving nodes via ordered: https://github.com/paritytech/parity-ethereum/blob/fe7bc545bf88351314a0c23c33568f42ea73539d/util/network-devp2p/src/node_table.rs#L323-L331
useless_nodes is cleared via clear_useless which is called in timeout here: https://github.com/paritytech/parity-ethereum/blob/fe7bc545bf88351314a0c23c33568f42ea73539d/util/network-devp2p/src/host.rs#L1067-L1070
Note that the node table timer is registered here: https://github.com/paritytech/parity-ethereum/blob/fe7bc545bf88351314a0c23c33568f42ea73539d/util/network-devp2p/src/host.rs#L504
NODE_TABLE_TIMEOUT is set to 5 minutes here: https://github.com/paritytech/parity-ethereum/blob/fe7bc545bf88351314a0c23c33568f42ea73539d/util/network-devp2p/src/host.rs#L89
Punishments are defined here: https://github.com/paritytech/parity-ethereum/blob/5dc5be1e58a2585d3d32f6dd4453789c605e3a00/ethcore/light/src/net/error.rs#L23-L36
Punishments are mapped to disable / disconnect behavior here: https://github.com/paritytech/parity-ethereum/blob/5dc5be1e58a2585d3d32f6dd4453789c605e3a00/ethcore/light/src/net/mod.rs#L1148-L1161
The punishment types map to specific error codes here: https://github.com/paritytech/parity-ethereum/blob/5dc5be1e58a2585d3d32f6dd4453789c605e3a00/ethcore/light/src/net/error.rs#L71-L91
Here are the details of my Geth investigation - Geth manages misbehaving peers by disconnecting from them via dropPeer - the relevant Geth component is the downloader which appears to be equivalent to Aleth's BlockChainSync class. Here's where the downloader defines the dropPeer callback: https://github.com/ethereum/go-ethereum/blob/7fd82a0e3e87aaf26198a6cf2de242c30fa7c2ab/eth/downloader/downloader.go#L123-L124
This callback (ProtocolManager::removePeer) is passed into the downloader ctor here: https://github.com/ethereum/go-ethereum/blob/7fd82a0e3e87aaf26198a6cf2de242c30fa7c2ab/eth/handler.go#L198
Here's the definition of ProtocolManager::removePeer: https://github.com/ethereum/go-ethereum/blob/7fd82a0e3e87aaf26198a6cf2de242c30fa7c2ab/eth/handler.go#L238-L255
Note that UnregisterPeer removes the node with the id from the downloader and peerset so that it will no longer be used in syncing.
Here are some places where dropPeer is called in the downloader when syncing:
- When syncing our local blockchain with a remote peer: https://github.com/ethereum/go-ethereum/blob/7fd82a0e3e87aaf26198a6cf2de242c30fa7c2ab/eth/downloader/downloader.go#L323-L338
- In the case of a timeout during
fetchHeaders: https://github.com/ethereum/go-ethereum/blob/7fd82a0e3e87aaf26198a6cf2de242c30fa7c2ab/eth/downloader/downloader.go#L1047-L1057 - In the case of a timeout during
fetchParts(presumably downloading headers and bodies): https://github.com/ethereum/go-ethereum/blob/7fd82a0e3e87aaf26198a6cf2de242c30fa7c2ab/eth/downloader/downloader.go#L1257-L1292