go-libp2p-pubsub
go-libp2p-pubsub copied to clipboard
Protocol layer-aware pubsub
Related to https://github.com/libp2p/go-libp2p-core/pull/99 the current gossip implementation has some issues for Prysm.
Firstly, the peers in the mesh are selected from all connected peers. This does not take in to account the fact that some peers may be considered bad at the protocol layer. Ideally, we would be able to inform gossipsub about the protocol layer's view of the world. To give an example how this may work, gossipsub could have two functions, AddPeer(peer.ID) and RmPeer(peer.ID) that would add and remove peers from a "protocol-ready" list from which gossipsub could build its mesh.
Secondly, messages are forwarded on as soon as they pass the network layer validation. This bypasses the protocol layer and so can forward on messages that pass network layer validation but fail protocol layer validation, resulting in lots of unnecessary work by nodes. In this situation there could be some way of holding off on forwarding a message before it has been validated at the protocol level, or perhaps even just terminating each message when it is received and letting the protocol layer pass the message back to be forwarded out to the mesh as/when it has validated it.
Note that implementation of the first item may well significantly reduce the impact of the second item, as it means that all peers in the mesh will have passed at least a cursory handshake at the protocol layer and so invalid protocol-layer messages would be expected to be significantly lower.
cc @nisdas from Prysm for any further comment.
We are working on adding a peer scoring function to gossipsub, such that peers can be assigned some score when the mesh is full and need to prune.
Re: validation: the validator should do the necessary protocol layer validation to reject bad messages, that's part of the design. They can be asynchronous, they don't have to return at once.
Also note that the connection gating you are considering might be detrimental for pubsub.
Note that the peer scoring function can also be utilized when first constructing the mesh so that we build a good mesh to begin with.
Thanks for the information. For my own understanding, if I had a score function that looked like this:
func Score(pid peer.ID) int {
if !Connected(pid) {
return -999
} else {
return PeerScore(pid)
}
}
where Connected and PeerScore were our protocol-level functions, would this basically allow us to meet the requirements as follows:
- do not use peers that are not connected at the protocol level
- preferentially use peers that are shown to provide good and timely information at the protocol level
And related, how would this handle our change of peer status? For example, when we first connect to a peer we are unconnected at the protocol level: it may take a few seconds for our handshake process to complete. If the mesh creation algorithm asked us for a score and we gave it as -999 due to not being connected, then a few seconds finished our handshake and are now connected, how can we let gossipsub know that this peer should now be considered for inclusion in the mesh?
And yes, there are various moving parts with gating and this (and possibly other pieces). Would it be more helpful if we tried to lay out how we would want the network to behave and used that as the basis to decide how to meet that requirement rather than focus on individual, and potentially conflicting, technical changes in each package?
The mesh should adapt organically to score changes: as peers come and go the mesh expands and contracts. The scoring will come into play when decisions to graft or prune peers need to be made. So it should adapt to score changes as needed.
In order to deal with the delay in obtaining an initial peer score from the protocol handshake, we might have to add some mechanism to delay the construction of the mesh until a user callback. The only thing we have currently is the heartbeat initial delay, which is a variable that can be set to control how long before the mesh construction happens in a new peer.
As for the overall requirements for how you want the network to work, yes it would be helpful to have a coherent high level approach laid out.
Actually, regarding the score delay, you might simply want to delay joining your topics until you have finished the handshake and established a score.
@vyzo please take a look at https://hackmd.io/@jgm/SJGg_VLCS it attempts to explain some of the issues we're seeing and suggests how a solution may be approached. I'm afraid I'm no libp2p expert so it may be that some of these can be done already and I'm just unaware of them, in which case I'll apologise in advance. Please let me know your thoughts on this, and let me know if you would like me to expand on any of the items in the doc.
@vyzo @Stebalien any thoughts on the document at https://hackmd.io/@jgm/SJGg_VLCS ? Can these features be achieved with the current codebase, and if not is this something that can be provided with libp2p?
To give an example of why this is becoming a necessary feature, here is a 1-second chunk of network logs from prysm:
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAmQQBSMDZto6Eh3FQHHEnA22cmPysj16z18zZX6yURWJ4o","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAm1kEiusD2sM9m3mqJfmsZkqB5Gtc1KeNSrXi8dzkUGw5Y","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAmJNZnTxJ2U3aHr8gzLb4xZnHmVXUPTNkQQMuFcP9RPg11","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAmJNZnTxJ2U3aHr8gzLb4xZnHmVXUPTNkQQMuFcP9RPg11","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAmJNZnTxJ2U3aHr8gzLb4xZnHmVXUPTNkQQMuFcP9RPg11","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAmJNZnTxJ2U3aHr8gzLb4xZnHmVXUPTNkQQMuFcP9RPg11","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAmJNZnTxJ2U3aHr8gzLb4xZnHmVXUPTNkQQMuFcP9RPg11","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAmLzCWAHS2bgfkxVeJuaidCrY2MtEMC7isCbWXH9gwdsmK","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAm1kEiusD2sM9m3mqJfmsZkqB5Gtc1KeNSrXi8dzkUGw5Y","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAmJNZnTxJ2U3aHr8gzLb4xZnHmVXUPTNkQQMuFcP9RPg11","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"active":53,"level":"info","msg":"Peer connected","peer":"16Uiu2HAkvLh5DSwvbPmMj6qUUicSpQd4yzv5pKqBJWCAwCXLkZt5","prefix":"p2p","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAmJNZnTxJ2U3aHr8gzLb4xZnHmVXUPTNkQQMuFcP9RPg11","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAmJNZnTxJ2U3aHr8gzLb4xZnHmVXUPTNkQQMuFcP9RPg11","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAm1Cj5uEQCLxagj8d4K3ruxBG5vPa7N7TVFwy21PrYA7d2","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAmQQBSMDZto6Eh3FQHHEnA22cmPysj16z18zZX6yURWJ4o","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAm9ZActvKa52vhznm5Z6buGtAAv1oU7mmdNkPd3fg3HQ6t","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAmJNZnTxJ2U3aHr8gzLb4xZnHmVXUPTNkQQMuFcP9RPg11","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAmJNZnTxJ2U3aHr8gzLb4xZnHmVXUPTNkQQMuFcP9RPg11","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAmNjeovbzarwEiwhnsQnmAApeTCorAoYpP2FdXRabsH8eQ","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAmNjeovbzarwEiwhnsQnmAApeTCorAoYpP2FdXRabsH8eQ","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAm1kEiusD2sM9m3mqJfmsZkqB5Gtc1KeNSrXi8dzkUGw5Y","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAm7vtg2SCkRa13yBPEkvduytKtLUC1YvVZL6qgXXXZLwMM","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
{"level":"debug","msg":"Ignoring connection request","peer":"16Uiu2HAmNjeovbzarwEiwhnsQnmAApeTCorAoYpP2FdXRabsH8eQ","prefix":"p2p","reason":"bad peer","time":"2020-01-17T09:35:26+01:00"}
This is on a relatively small (~100-node) network. Being able to shut these connections down earlier, or even better not opening them in the first place, would help us out.
Also CCing @raulk for input.
@mcdee it looks like your immediate problem is dealing with bad nodes. Could you use go-libp2p-pubsub's blacklist functionality here to blacklist nodes that fail at the application level?
This won't help when trying to figure out the "optimal" peer to connect to, but it may help in the meanwhile.
A couple caveats:
- If peers can switch from being bad to good then go-libp2p-pubsub would need an
UnblacklistPeerfunction - The blacklisting function is for a peer not a peer-topic pairing, so if you blacklist a peer you blacklist it for all topics. This may or may not be helpful to you.
@aschmahmann thank you for the suggestion. Unfortunately it appears that blacklsting the peer in pubsub doesn't stop it from continuing to attempt to talk to us; we continue to see the peer connecting to us and we have to keep disconnecting it.
From what I understand, this would reduce the pubsub messages we obtain from these peers but then we would need an UnblacklistPeer() as we decay bad peers back to (relatively) good over time.
It may be possible to make our own implementation of the Blacklist interface that was backed with our application-layer data structures, so that Contains() returned the correct information taking decay in to account. Is the user the only entity that can blacklist peers, or are there any automated systems within libp2p that may send BlacklistPeer() to us if we did this?
I think our ideal remains one or more interfaces the application layer can implement to control connections as per https://hackmd.io/@jgm/SJGg_VLCS#Suggested-approach as it would be purely informational (i.e. read-only) from the libp2p layer.
@mcdee yes if you want peers to decay back to being acceptable then you would need something like UnblacklistPeer.
I don't know of any subsystems that call BlacklistPeer, but I wouldn't count on that. Also, I'm not sure why you would need a custom Blacklist implementation when you could have your application just use events to call Blacklist (and Unblacklist) Peer. Unfortunately, IIUC creating your own custom blacklist doesn't really help with unblacklisting peers since you won't reopen a pubsub stream to that peer.
Your suggestions for more pluggability seem reasonable, I was just hoping to be able to help you progress a little faster 😃.
@mcdee take a look at the peer score integration in #263.
It allows for an application level score, which if negative will remove the peers from the mesh and only send gossip to them (if the score is not too low, see gossipThreshold).
@vyzo thank you for the link (and, of course, the PR).
From a quick read through it appears that I pass a WithPeerScore() that contains the global thresholds, and a callback that gives me a peer ID and expects a score back. A few questions:
- it appears that there are a number of components to the final score. Is there some way that we can be sure that a peer can be totally ignored, or is it a case of giving it a large negative score and assuming it will swamp the other values?
- are there any long-term impacts of providing a large negative score, i.e. is the provided score kept anywhere or is it refreshed from the callback each time it is needed?
- if we ignore a peer, will we still receive messages from it? Is this a one-way score where we don't send messages to low-value peers, or a two-way score where we neither send nor receive messages from low-value peers?
take a look at the spec in https://github.com/libp2p/specs/blob/master/pubsub/gossipsub/gossipsub-v1.1.md
- If the score of a peer drops below the
graylistThreshold, then it will be totally ignored. - It is computed live every time it is needed , so no long term impact
- It depends on how low the score is; there are 3 thresholds, as explained in the spec. Basically, if it drops below
graylistThresholdit is completely ignored; if it is above, you will receive their messages.
@mcdee Gossipsub scoring is scoped to the gossipsub level, and is used to drive mesh peer selection, i.e. which peers do I select to be part of my view of a topic mesh.
I reckon your requirement of blacklisting a peer is beyond gossipsub. Once you discover a badly behaved peer, you want to blacklist them universally within libp2p.
I think connection gating is what you're looking for: https://github.com/libp2p/go-libp2p-core/pull/99
I made a comment in the issue you mentioned, copied here:
we need to be able to close down incoming peer connections as soon as we can.
Similarly, we need to be able to inform the connection manager not to bother dialling out to a peer if we're just going to reject them when it comes to the protocol-level handshake.
What would be very useful would be to be able to have gating calls both before we dial outbound and before we accept inbound connections. These would be hooks that would receive details of the connection and would return a boolean "go/no-go" decision. We can then use the protocol-level information we have to decide if we want to complete the network-level connection and move on to the protocol-level connection.
Is this something that is going to be part of connection gating?
@mcdee I responded in libp2p/go-libp2p-core#99, to keep the convo in place. Thanks for raising here, though! I had missed your comment.