aptos-core
aptos-core copied to clipboard
[Networking] Fix inconsistent API in PeerMetadataStorage.
PeerMetadataStorage
is the single source of truth for peer information offered to applications by the networking stack. It provides applications with the ability to inspect the state of the peers in the network and enables many new use-cases, e.g., cross-application peer reputation, peer topological awareness, peer prioritization, tracking etc. It is currently only used by state sync v2 and the (new!) network monitoring service (https://github.com/aptos-labs/aptos-core/pull/1079), but we'll eventually want to migrate all applications across to it.
However, the API is currently broken/inconsistent. For each peer it provides PeerStates
including Connected
, Disconnected
and Disconnecting
(https://github.com/aptos-labs/aptos-core/blob/b1601d23fb157fee4c4932cd093d3809be8094a4/network/src/application/types.rs#L41-L40). Applications use these states (alongside other PeerMetadataStorage
information) to identify how to behave (e.g., state sync prioritizes peers based on this information). Unfortunately, these states are not accurate: peers that disconnect (e.g., due to a transient health checker failure) are entirely removed from the PeerMetadataStorage
(as opposed to being marked as disconnected) and are added back when reconnected. This is confusing and should be addressed.
We should either: (1) update the PeerMetadataStorage
API to only expose connected peers and make it clear that peers will be arbitrarily added or removed; or (2) update the API to expose all peers (connected or disconnected) and allow applications to build against this. My preference would be for option (2), e.g.,
- It allows applications to persist data across disconnections and reconnections, e.g., peer reputations, scores and topological information. Otherwise, peers can arbitrarily disconnect/reconnect to refresh/reset state (e.g., reputations) and it prevents applications from being hardened against bad/malicious peers.
- It allows the network monitoring service to better disseminate peer information throughout the network and not only monitor connected peers. This will become more and more important in a decentralized world.
- It moves us one step closer to a world where we can allow applications to provide "suggestions" to the networking stack about what to do with peers. For example, applications that care about "peer quality" can identify peers that do not behave well at the application layer and can notify the networking stack to remove/replace them.
(I have a guess that this issue relates to the TODO: https://github.com/aptos-labs/aptos-core/blob/b1601d23fb157fee4c4932cd093d3809be8094a4/network/src/application/types.rs#L39 but I'm not sure 😄)
Follow-up about how I discovered this:
- Whenever the state sync poller polls a peer, it marks the peer as having an in-flight request. The poller wont poll the peer again until the in-flight request is marked as complete.
- When the poller selects the peers to poll, it uses a prioritization algorithm. Priority peers get polled more often. Unfortunately, the method for determining the "priority" of a peer was based on the state of the peer in
PeerMetadataStorage
. - If a peer disconnects during a poll, it is removed from the
PeerMetadataStorage
and when the poll eventually times out, the prioritization algorithm cannot determine the peer priority correctly (because the peer no longer "exists"). As a result, the poller was unable to mark the in-flight request as complete. Thus, when the peer reconnects, it's not sent a new poll because the poller thinks there is still an in-flight request and blocks indefinitely. - This was discovered in
sherry-net
. If a validator is restarted during an in-flight poll, the VFN fails to re-poll the validator again (on recovery). If all validators are restarted, all VFNs no longer can see new data 😄
The PR linked above works around the issue by updating the prioritization algorithm to avoid relying on PeerMetadataStorage
when marking polls as complete. A manual workaround (without the fix is to restart the node).
This issue is stale because it has been open 45 days with no activity. Remove the stale
label or comment - otherwise this will be closed in 15 days.
This issue is stale because it has been open 45 days with no activity. Remove the stale
label or comment - otherwise this will be closed in 15 days.
@JoshLind Is any of your recent PRs related to this issue? If so, could you link this in your PRs?
@JoshLind Is any of your recent PRs related to this issue? If so, could you link this in your PRs?
Aah, for this specific issue, none of my recent PRs touch it. Right now, the behaviour is exactly the same. It'll likely be a while before I tackle this 😄 But yeah, it's definitely not stale!
This issue is stale because it has been open 45 days with no activity. Remove the stale
label or comment - otherwise this will be closed in 15 days.