mycelium Review the Mutex usage on PeerManager's peers

We currently use Mutex to protect PeerManager's peers HashMap.

https://github.com/threefoldtech/mycelium/blob/74719ba7cacb96e2d9251ce57760474e6973a14e/mycelium/src/peer_manager.rs#L159

There are two places that i think could be made read-only:

on the fn peers, i'm pretty sure that we can make it read-only

https://github.com/threefoldtech/mycelium/blob/74719ba7cacb96e2d9251ce57760474e6973a14e/mycelium/src/peer_manager.rs#L319-L339

on the peer_check_interval

It holds the lock while iterating the whole HashMap, twice, which might be quite long for public nodes

https://github.com/threefoldtech/mycelium/blob/74719ba7cacb96e2d9251ce57760474e6973a14e/mycelium/src/peer_manager.rs#L404-L422

With the current code, we can't simply replace the lock with read, we have to restructure it.

for example this line

self.peers.lock().unwrap().retain(|_, v| v.pt != PeerType::Inbound || v.pr.alive());

could be refactored to

let peers = self.peers.read().unwrap();
let mut to_remove = Vec::new();

for (addr, peer) in peers.iter() {
    if peer.pt == PeerType::Inbound && !peer.pr.alive() {
        to_remove.push(*addr);
    }
}
drop(peers);

// Remove the collected peers
let mut peers = self.peers.write().unwrap();
for addr in to_remove {
    peers.remove(&addr);
}

Similar with the second lock / iteration

I wondered the similar thing with the router, but looks like it is not significant

https://github.com/threefoldtech/mycelium/blob/74719ba7cacb96e2d9251ce57760474e6973a14e/mycelium/src/peer_manager.rs#L975
https://github.com/threefoldtech/mycelium/blob/74719ba7cacb96e2d9251ce57760474e6973a14e/mycelium/src/peer_manager.rs#L1125

Dec 25 '24 11:12 iwanbk

on the peer_check_interval

i also wonder why we have separate dead peer check on peer manager & router. Maybe we could merge the PeerManager's peer_check_interval with Router's check_for_dead_peers

https://github.com/threefoldtech/mycelium/blob/74719ba7cacb96e2d9251ce57760474e6973a14e/mycelium/src/peer_manager.rs#L404

https://github.com/threefoldtech/mycelium/blob/74719ba7cacb96e2d9251ce57760474e6973a14e/mycelium/src/router.rs#L282

Dec 26 '24 03:12 iwanbk

The reason for a mutex is basically that this field is accessed only very infrequently outside of the peer managers peer check loop, and for simplicity. Even for the public nodes right now, this is not an issue. I also don't think changing to an RWLock would solve a performance issue here, if there ever is one, compared to a mutex. The main issue would be contention, and RWLocks also behave poorly under contention.

On another note, we'll increase the delay for scanning for dead peers in the peer manager in the future.

i also wonder why we have separate dead peer check on peer manager & router. Maybe we could merge the PeerManager's peer_check_interval with Router's check_for_dead_peers

This is a historic thing, however the router uses a different condition to determine if a peer is dead or not. In the router, we check the time since we last received an IHU, whereas the peer_manager is concerned about the liveness of the actual connection. Therefore, to maintain some separation of concern, they both have a different task.

In theory, the router could always know about dead peers, so there could be a channel for the router to notify the peer_manager, though for now this is sufficient. In that scenario, the peer_manager would also need to add some timeouts before reconnecting, as we don't want a hot reconnect loop to a potential malfunctioning peer

Dec 26 '24 09:12 LeeSmet