Review the Mutex usage on PeerManager's peers
We currently use Mutex to protect PeerManager's peers HashMap.
https://github.com/threefoldtech/mycelium/blob/74719ba7cacb96e2d9251ce57760474e6973a14e/mycelium/src/peer_manager.rs#L159
There are two places that i think could be made read-only:
- on the
fn peers, i'm pretty sure that we can make it read-only
https://github.com/threefoldtech/mycelium/blob/74719ba7cacb96e2d9251ce57760474e6973a14e/mycelium/src/peer_manager.rs#L319-L339
- on the
peer_check_interval
It holds the lock while iterating the whole HashMap, twice, which might be quite long for public nodes
https://github.com/threefoldtech/mycelium/blob/74719ba7cacb96e2d9251ce57760474e6973a14e/mycelium/src/peer_manager.rs#L404-L422
With the current code, we can't simply replace the lock with read, we have to restructure it.
for example this line
self.peers.lock().unwrap().retain(|_, v| v.pt != PeerType::Inbound || v.pr.alive());
could be refactored to
let peers = self.peers.read().unwrap();
let mut to_remove = Vec::new();
for (addr, peer) in peers.iter() {
if peer.pt == PeerType::Inbound && !peer.pr.alive() {
to_remove.push(*addr);
}
}
drop(peers);
// Remove the collected peers
let mut peers = self.peers.write().unwrap();
for addr in to_remove {
peers.remove(&addr);
}
Similar with the second lock / iteration
I wondered the similar thing with the router, but looks like it is not significant
- https://github.com/threefoldtech/mycelium/blob/74719ba7cacb96e2d9251ce57760474e6973a14e/mycelium/src/peer_manager.rs#L975
- https://github.com/threefoldtech/mycelium/blob/74719ba7cacb96e2d9251ce57760474e6973a14e/mycelium/src/peer_manager.rs#L1125
on the peer_check_interval
i also wonder why we have separate dead peer check on peer manager & router.
Maybe we could merge the PeerManager's peer_check_interval with Router's check_for_dead_peers
https://github.com/threefoldtech/mycelium/blob/74719ba7cacb96e2d9251ce57760474e6973a14e/mycelium/src/peer_manager.rs#L404
https://github.com/threefoldtech/mycelium/blob/74719ba7cacb96e2d9251ce57760474e6973a14e/mycelium/src/router.rs#L282
The reason for a mutex is basically that this field is accessed only very infrequently outside of the peer managers peer check loop, and for simplicity. Even for the public nodes right now, this is not an issue. I also don't think changing to an RWLock would solve a performance issue here, if there ever is one, compared to a mutex. The main issue would be contention, and RWLocks also behave poorly under contention.
On another note, we'll increase the delay for scanning for dead peers in the peer manager in the future.
i also wonder why we have separate dead peer check on peer manager & router. Maybe we could merge the PeerManager's peer_check_interval with Router's check_for_dead_peers
This is a historic thing, however the router uses a different condition to determine if a peer is dead or not. In the router, we check the time since we last received an IHU, whereas the peer_manager is concerned about the liveness of the actual connection. Therefore, to maintain some separation of concern, they both have a different task.
In theory, the router could always know about dead peers, so there could be a channel for the router to notify the peer_manager, though for now this is sufficient. In that scenario, the peer_manager would also need to add some timeouts before reconnecting, as we don't want a hot reconnect loop to a potential malfunctioning peer