Improve peer finding
- [x] Update Nabu dependency to latest version
- [x] Ensure we can keep a two digit number of peers connected and synching at all times.
Please check out #79 as well.
Please check out #79 as well.
Issue has been diagnosed, but not fixed as it's not that high priority. Read the comment here.
Ensure we can keep a two digit number of peers connected and synching at all times.
That's a rather bigger task as understanding the entire abstract idea of peer finding with the Kademlia protocol and then distinguishing implementational boundaries (i.e. what Nabu provides in terms of "Kademlia" implementation and what's left up to us as users to decide and implement?; where does the abstraction of "Kademlia" as a lower level protocol end and where do Polkadot business decisions on top of that begin?) takes time to get into. I've not reached any major revelations as to how our peer finding can be "improved" yet as I've struggled to even isolate a minimal definition of "finding a peer" and "connection to a peer", albeit attempting to define a test case or at least an isolated environment (understand: what are even the minimum requirements to "be connected to a peer" so that a peer is happy and doesn't drop the connection... in terms of the polkadot network's needs?).
On a side note, whilst digging into our fork of Nabu I stumbled upon this seemingly unused boolean flag. The intention behind it was initially (a long time ago) to satisfy the intent: we want our node to be a client (not a server) peer (terms in the context of the Kademlia protocol). However, I suspect jvm-libp2p already handles that based on whether you're an initiator or receiver at the end of a P2PChannel (reason: This doc comment + use case within Nabu)... Either this or there is missing logic on our side (either in Nabu or in Fruzhin) to implement this role diversification. For comparison, this is how the equivalent in GO handles this.
I'm not sure, but this is a sidetrack nonetheless.
Work can be picked up on this branch.
Useful reference materials:
- The abstract
libp2pspec: https://github.com/libp2p/specs/blob/master/kad-dht/README.md#libp2p-kademlia-dht-specification - The JVM implementation we're using (transitively through
Nabu): https://github.com/libp2p/jvm-libp2p/tree/9fd77410345f72bb1c327ef8d09a9f3ed41d15be - The exact current source of
Nabuused in Fruzhin can be found implementation("com.github.LimeChain:nabu:16c6586"); -
The
Kademliawhitepaper (extremely useful for grasping the abstract overall concepts behind the protocol). Keep in mind that each implementation has its variations; on top of that, the Polkadot spec introduces some slight variations on top of Kademlia, so this whitepaper is not a 1:1 description of our exact use case within Fruzhin.
where does the abstraction of "Kademlia" as a lower level protocol end and where do Polkadot business decisions on top of that begin?) takes time to get into
After looking into the topic I can conclude 2 things:
- Kademlia takes care of the initial "discovering" of new peers via it's "distance between peers" logic for finding closest peers and opening connections to them.
- Polkadot's logic is strictly related to all the substreams being opened. Think about block announce, grandpa, beefy, etc. This is process that follows Kademlia's discovery. If we do not initiate a handshake on the notification protocol the newly discovered peers remain "inactive" in the sense of polkadot.
I've struggled to even isolate a minimal definition of "finding a peer" and "connection to a peer"
- "finding a peer" -> the Kademlia implementation takes care of that for us. It finds the closest peers, connects and opens up the
kadandpingprotocols. - "connection to a peer" -> this depends on the context. If we view it from libp2p pov the previous bullet point explains it. However, from Polkadot's pov we are responsible for sending a handshake to "connect" to the specific protocol substreams and start communicating (eg. block announce)
what are even the minimum requirements to "be connected to a peer" so that a peer is happy and doesn't drop the connection
Dropping the connection is usually related to having a low "reputation". The reputation is lowered when we for example send too many catch up requests, send commit messages that are too far in the past of the future based on the peer's state. As of this message Fruzhin does not have a reputation implementation.
While debugging I also found that some peers send quite a lot of "stream closing" signals. I have yet to understand why. The strange thing is that after doing that they send us a handshake to reopen the stream. The main concern that I have is that we might not be differentiating the "initiator" and "responder" in a correct manner.
A few issues arose from the recent changes:
- Our libp2p library seems to reject ping requests when handling too many ping streams. It is probably related to the implementation of the
Ping.kt. It uses avar timeoutScheduler by lazyVar { Executors.newSingleThreadScheduledExecutor() }which rejects ping requests after some amount. - Fruzhin tends to use the libp2p's default thread pool (see
workerGroupin https://github.com/libp2p/jvm-libp2p/blob/develop/libp2p/src/main/kotlin/io/libp2p/transport/implementation/NettyTransport.kt) for some blocking operations. That results in thread starvation and, eventually, a deadlock if we connect to too many peers. - Logging is a mess with too many peers.