aleth icon indicating copy to clipboard operation
aleth copied to clipboard

Use peer ratings to determine if peer should be disconnected

Open halfalicious opened this issue 6 years ago • 5 comments

We currently maintain a rating per peer (Peer::m_rating) which starts at 0 and is adjusted in response to activity with that peer. We make positive adjustments if the peer provides value during syncing and negative adjustments if the peer doesn't provide value . For example, we add 100 to a peer's rating if we successfully import one of their transactions and subtract 100 if their transaction is malformed: https://github.com/ethereum/aleth/blob/71124053cca611a7d3f3a3c987edb45a606a2955/libethereum/EthereumCapability.cpp#L606-L620

We currently only use this rating to determine the order in which we iterate over peers when syncing - from Host::foreEachPeer: https://github.com/ethereum/aleth/blob/71124053cca611a7d3f3a3c987edb45a606a2955/libp2p/Host.cpp#L1097-L1104

We should also use this rating to determine if we should disconnect from a peer.

halfalicious avatar Apr 30 '19 03:04 halfalicious

Note that we also use the rating updates to determine how long we should wait until we reconnect to a peer that we were disconnected from.

Here's the logic that Peer uses to determine if we should reconnect to the associated node: https://github.com/ethereum/aleth/blob/71124053cca611a7d3f3a3c987edb45a606a2955/libp2p/Peer.cpp#L46-L49

Peer::fallbackSeconds() takes into account m_failedAttempts when computing the amount of time to wait: https://github.com/ethereum/aleth/blob/71124053cca611a7d3f3a3c987edb45a606a2955/libp2p/Peer.cpp#L51-L73

m_failedAttempts is reset to 0 in Peer::noteSessionGood()...Peer::noteSessionGood() is called on a peer rating update when the update amount is >= 0: https://github.com/ethereum/aleth/blob/71124053cca611a7d3f3a3c987edb45a606a2955/libp2p/Session.cpp#L87-L96

halfalicious avatar Apr 30 '19 04:04 halfalicious

We saw already that we sometimes keep asking the peer for the same block many times despite its NoMoreHeaders answer. Its rating is already decreased in this case, but it continues indefinitely, because low rating never leads to disconnect.

I've just tried to sync Ropsten with your latest changes (making bootnodes not required and removing useless peers), and can confirm that it's still the case.

DEBUG 06-18 15:33:56 p2p  peer   Requesting 1024 block headers starting from 510260 from ##865a6325… 865a6325…
DEBUG 06-18 15:33:56 p2p  sync   BlocksHeaders (0 entries) : NoMoreHeaders from ##865a6325…
DEBUG 06-18 15:33:56 p2p  peer   Requesting 1024 block headers starting from 510260 from ##865a6325… 865a6325…
DEBUG 06-18 15:33:56 p2p  sync   BlocksHeaders (0 entries) : NoMoreHeaders from ##865a6325…
DEBUG 06-18 15:33:56 p2p  peer   Requesting 1024 block headers starting from 510260 from ##865a6325… 865a6325…
DEBUG 06-18 15:33:57 p2p  sync   BlocksHeaders (0 entries) : NoMoreHeaders from ##865a6325…
DEBUG 06-18 15:33:57 p2p  peer   Requesting 1024 block headers starting from 510260 from ##865a6325… 865a6325…
DEBUG 06-18 15:33:58 p2p  sync   BlocksHeaders (0 entries) : NoMoreHeaders from ##865a6325…
DEBUG 06-18 15:33:58 p2p  peer   Requesting 1024 block headers starting from 510260 from ##865a6325… 865a6325…
DEBUG 06-18 15:33:58 p2p  sync   BlocksHeaders (0 entries) : NoMoreHeaders from ##865a6325…
DEBUG 06-18 15:33:58 p2p  peer   Requesting 1024 block headers starting from 510260 from ##865a6325… 865a6325…
DEBUG 06-18 15:33:58 p2p  sync   BlocksHeaders (0 entries) : NoMoreHeaders from ##865a6325…

This is Parity bootnode and I guess we connect to it once and then never disconnect.

Also from the perspective of the user with default verbosity this looks weird, because it reports having 1 active peer, but the sync doesn't progress.

gumb0 avatar Jun 18 '19 15:06 gumb0

@gumb0 Yea, what you're seeing makes sense since we can successfully connect and peer with the Parity bootnode (my changes won't have any affect given that they only impact nodes that we can't successfully connect to or which get disconnected from).

I think I'm going to take a look at what other clients do with respect to disconnecting from nodes which aren't proving useful during sync (e.g. aren't providing us with blocks, providing us with the wrong blocks) so I can come up with a plan for when we should disconnect from these nodes.

halfalicious avatar Jun 21 '19 05:06 halfalicious

I've taken a look at the Geth and Parity code - long story short, neither of them have a reputation system and simply disconnect peers which misbehave (and Parity has a timer which prevents itself from reconnecting to a misbehaving peer for up to 5 minutes).

Here are the details of my Parity investigation:

Parity seems to do one of two things to a badly behaving peer - disconnect or disable it. Disconnecting a peer involves sending a p2p capability disconnect packet to it whereas disabling a peer involves both disconnecting from the peer and marking it as useless (which prevents a reconnection for up to 5 minutes). I asked the Parity folks why they didn't disable peers permanently since some of the disable cases seem critical (e.g. peer is on different network) and wouldn't resolve in 5 minutes and they said that it's difficult to tell the difference between a malicious node and one that's on the wrong fork or just out-of-sync, which makes sense. Additionally, in practice it should take longer than 5 minutes before reconnecting to a peer since the disabled peer is put back into a pool of potential peers to reconnect to.

Disconnect / disable maps to a concept of punishments, where a punishment is determined based on the type of error which occurred (e.g. wrong headers, protocol version mismatch). There are currently only two punishments - disconnect or disable (but additional punishments could be added as a part of a reputation system in the future).

Details Disabling a peer (disable_peer) and disconnecting a peer (disconnect_peer) are implemented here: https://github.com/paritytech/parity-ethereum/blob/5dc5be1e58a2585d3d32f6dd4453789c605e3a00/util/network-devp2p/src/host.rs#L1133-L1142

Note that disable_peer also marks the node as useless by inserting its id into useless_nodes - the contents of useless_nodes is used as a filter when retrieving nodes via ordered: https://github.com/paritytech/parity-ethereum/blob/fe7bc545bf88351314a0c23c33568f42ea73539d/util/network-devp2p/src/node_table.rs#L323-L331

useless_nodes is cleared via clear_useless which is called in timeout here: https://github.com/paritytech/parity-ethereum/blob/fe7bc545bf88351314a0c23c33568f42ea73539d/util/network-devp2p/src/host.rs#L1067-L1070

Note that the node table timer is registered here: https://github.com/paritytech/parity-ethereum/blob/fe7bc545bf88351314a0c23c33568f42ea73539d/util/network-devp2p/src/host.rs#L504

NODE_TABLE_TIMEOUT is set to 5 minutes here: https://github.com/paritytech/parity-ethereum/blob/fe7bc545bf88351314a0c23c33568f42ea73539d/util/network-devp2p/src/host.rs#L89

Punishments are defined here: https://github.com/paritytech/parity-ethereum/blob/5dc5be1e58a2585d3d32f6dd4453789c605e3a00/ethcore/light/src/net/error.rs#L23-L36

Punishments are mapped to disable / disconnect behavior here: https://github.com/paritytech/parity-ethereum/blob/5dc5be1e58a2585d3d32f6dd4453789c605e3a00/ethcore/light/src/net/mod.rs#L1148-L1161

The punishment types map to specific error codes here: https://github.com/paritytech/parity-ethereum/blob/5dc5be1e58a2585d3d32f6dd4453789c605e3a00/ethcore/light/src/net/error.rs#L71-L91

halfalicious avatar Jul 07 '19 22:07 halfalicious

Here are the details of my Geth investigation - Geth manages misbehaving peers by disconnecting from them via dropPeer - the relevant Geth component is the downloader which appears to be equivalent to Aleth's BlockChainSync class. Here's where the downloader defines the dropPeer callback: https://github.com/ethereum/go-ethereum/blob/7fd82a0e3e87aaf26198a6cf2de242c30fa7c2ab/eth/downloader/downloader.go#L123-L124

This callback (ProtocolManager::removePeer) is passed into the downloader ctor here: https://github.com/ethereum/go-ethereum/blob/7fd82a0e3e87aaf26198a6cf2de242c30fa7c2ab/eth/handler.go#L198

Here's the definition of ProtocolManager::removePeer: https://github.com/ethereum/go-ethereum/blob/7fd82a0e3e87aaf26198a6cf2de242c30fa7c2ab/eth/handler.go#L238-L255

Note that UnregisterPeer removes the node with the id from the downloader and peerset so that it will no longer be used in syncing.

Here are some places where dropPeer is called in the downloader when syncing:

  • When syncing our local blockchain with a remote peer: https://github.com/ethereum/go-ethereum/blob/7fd82a0e3e87aaf26198a6cf2de242c30fa7c2ab/eth/downloader/downloader.go#L323-L338
  • In the case of a timeout during fetchHeaders: https://github.com/ethereum/go-ethereum/blob/7fd82a0e3e87aaf26198a6cf2de242c30fa7c2ab/eth/downloader/downloader.go#L1047-L1057
  • In the case of a timeout during fetchParts (presumably downloading headers and bodies): https://github.com/ethereum/go-ethereum/blob/7fd82a0e3e87aaf26198a6cf2de242c30fa7c2ab/eth/downloader/downloader.go#L1257-L1292

halfalicious avatar Jul 07 '19 22:07 halfalicious