rust-libp2p icon indicating copy to clipboard operation
rust-libp2p copied to clipboard

feat(kad): add `refresh_interval` config used to poll `bootstrap`

Open PanGan21 opened this issue 2 years ago • 47 comments

Description

Previously, users were responsible for calling bootstrap on an interval. This was documented but hard to discover for people new to the library. To maintain healthy routing tables, it is advised to regularly call bootstrap. By default, we will now do this automatically every 5 minutes and once we add a peer to our routing table, assuming we didn't bootstrap yet. This is especially useful as part of nodes starting up and connecting to bootstrap nodes.

Closes: #4730.

Attributions

Co-authored-by: stormshield-frb [email protected]

Notes & open questions

Change checklist

  • [x] I have performed a self-review of my own code
  • [x] I have made corresponding changes to the documentation
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [x] A changelog entry has been made in the appropriate crates

PanGan21 avatar Nov 12 '23 16:11 PanGan21

Looks like a good start to me. Correct me if im wrong but should we only poll the timer if Behaviour::bootstrap is successfully called?

dariusc93 avatar Nov 13 '23 01:11 dariusc93

Correct me if im wrong but should we only poll the timer if Behaviour::bootstrap is successfully called?

Why? In order to maintain a healthy routing table, bootstrap should be called regularly. I don't think there is a reason to wait for the user call it. In fact, I am wondering if we perhaps shouldn't remove the public bootstrap method entirely in favor of the configuration option to do it internally?

Something that we'd have to think about is: When a node starts up, it doesn't have any connections yet. Thus, we should probably delay the very first bootstrap query until we have a connection to another DHT peer at which point we should instantly bootstrap and only perform it in an interval going forward.

If we naively call it on an interval, we will execute it straight at startup where it will fail because there are no peers, resulting in a delay of 5 minutes until we will actually bootstrap.

thomaseizinger avatar Nov 13 '23 02:11 thomaseizinger

In fact, I am wondering if we perhaps shouldn't remove the public bootstrap method entirely in favor of the configuration option to do it internally?

cc @mxinden

thomaseizinger avatar Nov 13 '23 02:11 thomaseizinger

Why? In order to maintain a healthy routing table, bootstrap should be called regularly. I don't think there is a reason to wait for the user call it. In fact, I am wondering if we perhaps shouldn't remove the public bootstrap method entirely in favor of the configuration option to do it internally?

Something that we'd have to think about is: When a node starts up, it doesn't have any connections yet. Thus, we should probably delay the very first bootstrap query until we have a connection to another DHT peer at which point we should instantly bootstrap and only perform it in an interval going forward.

If we naively call it on an interval, we will execute it straight at startup where it will fail because there are no peers, resulting in a delay of 5 minutes until we will actually bootstrap.

I do wonder if it should be done automatically instead of waiting for it to be called, given that the spec does mention doing it at the start, so maybe it could be made a configurable option, although the developer loses out on flexibility on making that call themselves. Regardless, in its current state, a successful call to Behaviour::bootstrap should, in my opinion, trigger the timer to start rather than for the node to instantly start the timer upon enabling the behaviour, especially if the node does not wish to bootstrap right at that moment (and not sure if its an actual requirement for the node to bootstrap, but maybe @mxinden can provide insight on that). It might even be better to start/reset the timer after the query itself has been completed.

dariusc93 avatar Nov 13 '23 22:11 dariusc93

Why? In order to maintain a healthy routing table, bootstrap should be called regularly. I don't think there is a reason to wait for the user call it. In fact, I am wondering if we perhaps shouldn't remove the public bootstrap method entirely in favor of the configuration option to do it internally? Something that we'd have to think about is: When a node starts up, it doesn't have any connections yet. Thus, we should probably delay the very first bootstrap query until we have a connection to another DHT peer at which point we should instantly bootstrap and only perform it in an interval going forward. If we naively call it on an interval, we will execute it straight at startup where it will fail because there are no peers, resulting in a delay of 5 minutes until we will actually bootstrap.

I do wonder if it should be done automatically instead of waiting for it to be called, given that the spec does mention doing it at the start, so maybe it could be made a configurable option, although the developer loses out on flexibility on making that call themselves. Regardless, in its current state, a successful call to Behaviour::bootstrap should, in my opinion, trigger the timer to start rather than for the node to instantly start the timer upon enabling the behaviour, especially if the node does not wish to bootstrap right at that moment (and not sure if its an actual requirement for the node to bootstrap, but maybe @mxinden can provide insight on that). It might even be better to start/reset the timer after the query itself has been completed.

Those are good points. How about the following:

  • We add a new function, start_regular_bootstrap (name to be bike-shedded) that kicks off an initial bootstrap.
  • We remember the query ID of this bootstrap within the Behaviour.
  • Once if finishes, we reset the timer. We can tell that this was a "regular" bootstrap by comparing it against the stored ID. We may want to not report any events generated by this regular bootstrap.

For now, I'd leave the current bootstrap untouched but I think it is worth considering to deprecate it in favor of start_regular_bootstrap.

thomaseizinger avatar Nov 14 '23 00:11 thomaseizinger

Today users forget to (a) call bootstrapp at startup and (b) call bootstrap continuously thereafter. In my eyes in most cases one should do both (a) and (b). Thus I believe it should be the default behavior and the goal of https://github.com/libp2p/rust-libp2p/issues/4730.

  • We add a new function, start_regular_bootstrap (name to be bike-shedded) that kicks off an initial bootstrap.

That would again require users to call a method in the default case which they will likely forget.

How about triggering the first bootstrap once the first node is added?

For now, I'd leave the current bootstrap untouched

I am fine with leaving the existing fn bootstrap` untouched and allowing users to disable the automatic reoccurring interval.

mxinden avatar Nov 14 '23 20:11 mxinden

Today users forget to (a) call bootstrapp at startup and (b) call bootstrap continuously thereafter. In my eyes in most cases one should do both (a) and (b). Thus I believe it should be the default behavior and the goal of #4730.

  • We add a new function, start_regular_bootstrap (name to be bike-shedded) that kicks off an initial bootstrap.

That would again require users to call a method in the default case which they will likely forget.

How about triggering the first bootstrap once the first node is added?

For now, I'd leave the current bootstrap untouched

I am fine with leaving the existing fn bootstrap` untouched and allowing users to disable the automatic reoccurring interval.

I will emit that I would be guilty of (b), though for (a), the only times I find myself not bootstrapping is when I wish to only query peers added to the routing table manually. I take it that that (a) should be done when one peer is added regardless and not wait around until a later point? Are there any implications from not bootstrapping in the first place even when its only one peer added (with that peer not containing any other peers in its table)?

dariusc93 avatar Nov 14 '23 20:11 dariusc93

To quote @mxinden from an earlier conversation: A bootstrap is just a query. If you want to fire a query asap on startup, fire it. A bootstrap just fills up the buckets to make future queries faster. If you already have a query you want the result for, running it will give you a quicker results than first waiting for the bootstrap!

thomaseizinger avatar Nov 14 '23 21:11 thomaseizinger

Hi everyone. Sorry to barge in 3 days later, but we just saw this PR and might have some interesting thoughts to share.

We totally agree with the initial issue that it would be a great thing for the end user if the Kademlia behaviour handled bootstrapping on its own. (Why not also call it refresh, it might make more sense, but we don't have a particular fixed idea on this matter).

However, we think it might be a good idea to have a system a little bit more evolved than just bootstrapping based on an interval. Moreover, there is already in the mDNS behaviour, a system allowing to create many packets at first and gradually reaching a cruise interval (protocols/mdns/src/behaviour/iface.rs:ProbeState). Why not do something like that?

We did and are currently using a specific development to handle bootstrap in a more evolved manner, so we would be more than happy to help on this subject, and why not upstreaming it into the libp2p code base if you want.

Our bootstrap is composed of several "components" and we are not saying that everything should be integrated in the libp2p code base. The only purpose of the following description of our implementation is to be a starting point of an open discussion around this issue.

  1. our bootstrap system is based on an IncrementedStream which is almost the same thing as an exponential backoff. The only difference is that for example, between 1 and 16, increments for an exponential backoff would be 1, 2, 4, 8, 16 and that for our IncrementedStream, increments are 1, 1, 1, 1, 1, 2, 2, 2, 2, 4, 4, 4, 8, 8, 16 (i.e. 1(x5), 2(x4), 4(x3), 8(x2), 16(x1)). So it just means that we are multiplying the lowest increments of the exponential backoff to be able to run more often at these durations. So our bootstrap process is very frequent at peer startup and will then gradually slow down to reach the final interval (almost like the mdns ProbeState).

  2. we also integrated a system allowing us to reset this timer in the event of a huge network perturbation. On each FromSwarm::ConnectionEstablished and FromSwarm::ConnectionClosed, we store the number of connected_peers and try to determine if the most recent value of connected_peers imply an important change. To be more clear, losing connection to 4 peers does not have the same meaning nor the same impact if we previously were connected to 6 peers or 50 peers. To measure the importance of the change, we verify for how long we had a "stable network" (i.e. if the number of connected_peers was roughly stable during a specified amount of time), and then we verify if the current number of connected_peers represent an important variation based on the "average" number of connected_peers during the last N minutes. If an important change is detected, the timer is reset, and we restart our rapid bootstrap.

If you find this interesting, I will gladly elaborate and share code samples.

stormshield-frb avatar Nov 15 '23 17:11 stormshield-frb

Thank you for the input @stormshield-frb ! Whilst interesting, I am not sure we need an elaborate algorithm like this as a default.

If we keep the current bootstrap method and invoke it on a configurable timer that can be disabled, it should be easy enough to implement different bootstrap schemes if a simple timer is not sufficient. At the same time, having a timer that runs out-of-the-box makes libp2p-kad more useful for people getting started with it.

This follows the mantra of: "make the easy things easy, make the hard things possible".

Would that work for everybody?

thomaseizinger avatar Nov 16 '23 00:11 thomaseizinger

Hi @thomaseizinger, you're welcome :wink: Glad you found it interesting. Indeed, it might be too much by default, and I do agree with your mantra.

However, quoting you from an earlier comment:

Something that we'd have to think about is: When a node starts up, it doesn't have any connections yet. Thus, we should probably delay the very first bootstrap query until we have a connection to another DHT peer at which point we should instantly bootstrap and only perform it in an interval going forward.

If we naively call it on an interval, we will execute it straight at startup where it will fail because there are no peers, resulting in a delay of 5 minutes until we will actually bootstrap.

Doing something like it is currently done in mdns would help in this matter. We don't have to do what I described (custom exponential backoff), a simple exponential backoff like it is done in mdns with ProbeState would probably help a lot. Maybe something simple like this could be considered?

For the metric part (number of connected_peers) I will not strongly advocate for it right now because this feature is still quite new in our code base, so we don't have as much background to assert it is efficient. We will keep using this in the near future, but I totally understand your point of view. The more complex the algorithm, the more error-prone it is :joy: If we see amazing results in the next several months, we will consider upstreaming it in a new PR :wink:

I am fine with leaving the existing fn bootstrap untouched and allowing users to disable the automatic reoccurring interval.

As @mxinden said, we would gladly appreciate it if the current fn bootstrap was left untouched since it would allow, like you said, to "make hard things possible".

stormshield-frb avatar Nov 16 '23 09:11 stormshield-frb

Something that we'd have to think about is: When a node starts up, it doesn't have any connections yet. Thus, we should probably delay the very first bootstrap query until we have a connection to another DHT peer at which point we should instantly bootstrap and only perform it in an interval going forward. If we naively call it on an interval, we will execute it straight at startup where it will fail because there are no peers, resulting in a delay of 5 minutes until we will actually bootstrap.

Doing something like it is currently done in mdns would help in this matter. We don't have to do what I described (custom exponential backoff), a simple exponential backoff like it is done in mdns with ProbeState would probably help a lot. Maybe something simple like this could be considered?

Yes we definitely need to handle that. I think to start with, it is probably easiest to have a simple state variable that tracks, whether we've already successfully run at least 1 bootstrap.

Upon every new connection, we can then initiate a bootstrap if we haven't successfully bootstrapped yet (we should also track whether there is currently one in progress). It could be that the first peer we connect to does not support kademlia so we should not give up after the first peer.

Something like:

enum InitialBootstrap {
	Active(QueryId),
	Completed
}

and:

initial_bootstrap: Option<InitialBootstrap>

If the bootstrap fails, we set it back to None.

What do people think?

thomaseizinger avatar Nov 16 '23 22:11 thomaseizinger

have a simple state variable that tracks, whether we've already successfully run at least 1 bootstrap.

Absolutely. This is a good idea. We actually do that in our code base.

However, it raises the question : what should we do if the first bootstrap fail ? Should be try an other bootstrap right after ? Some time after ? What if this one fails too ?

In my opinion, this situation conforms precisely to the recommended practices of using exponential backoff. What do you think ?

stormshield-frb avatar Nov 20 '23 08:11 stormshield-frb

Should be try an other bootstrap right after ?

On the next connection yeah.

In my opinion, this situation conforms precisely to the recommended practices of using exponential backoff. What do you think ?

Exponential backoffs are useful if simply having time pass might fix the problem (e.g. a remote being overloaded). In our case, all we need to wait for is another connection that might be a kademlia node. Or, put differently, if we haven't gotten a new connection, retrying bootstrap will not make a difference and if we have a new connection, there is no reason to delay another bootstrap attempt.

Once we've had one successful bootstrap, we should be able to refresh our routing table based on the remaining entries. We could use an exponential backoff for that as it is much more likely that a bootstrap fails due to us being disconnected from the network.

I'd open to considering that but equally, I think a simple timer would also already be an improvement. Happy to go with whatever direction @mxinden and @PanGan21 as the contributor is happy with :)

For example, perhaps we can land a simple timer first and discuss an expontential backoff as an optimisation after that you @stormshield-frb could usptream?

thomaseizinger avatar Nov 21 '23 00:11 thomaseizinger

Should be try an other bootstrap right after ?

On the next connection yeah.

In my opinion, this situation conforms precisely to the recommended practices of using exponential backoff. What do you think ?

Exponential backoffs are useful if simply having time pass might fix the problem (e.g. a remote being overloaded). In our case, all we need to wait for is another connection that might be a kademlia node. Or, put differently, if we haven't gotten a new connection, retrying bootstrap will not make a difference and if we have a new connection, there is no reason to delay another bootstrap attempt.

Once we've had one successful bootstrap, we should be able to refresh our routing table based on the remaining entries. We could use an exponential backoff for that as it is much more likely that a bootstrap fails due to us being disconnected from the network.

I'd open to considering that but equally, I think a simple timer would also already be an improvement. Happy to go with whatever direction @mxinden and @PanGan21 as the contributor is happy with :)

For example, perhaps we can land a simple timer first and discuss an expontential backoff as an optimisation after that you @stormshield-frb could usptream?

I am open to continue both options (just a timer and exponential backoff). I think though it is better to split it in two issues/prs with different acceptance criteria. Of course if you have different opinion I am still available to contribute :)

PanGan21 avatar Nov 22 '23 10:11 PanGan21

Exponential backoffs are useful if simply having time pass might fix the problem (e.g. a remote being overloaded). In our case, all we need to wait for is another connection that might be a kademlia node. Or, put differently, if we haven't gotten a new connection, retrying bootstrap will not make a difference and if we have a new connection, there is no reason to delay another bootstrap attempt.

Thanks for this explanation @thomaseizinger. I was not aware of that. Given your explanation, it indeed seems that exponential backoff is not necessary the best fit here. I like your idea of boostraping on new connection.

So if I understand correctly:

  1. we bootstrap on new connections until one succeeds then go to 2
  2. we bootstrap on a timer (configured by the end user) and stays on 2 while bootstrap succeeds
  3. if bootstrap fails, we start an exponential backoff until bootstrap succeed again (so we go back to 2)

Did I correctly understood ?

stormshield-frb avatar Nov 22 '23 14:11 stormshield-frb

Yes, thank you for laying it out clearly @stormshield-frb ! As @PanGan21 said, I think it makes sense to focus this PR on (1) & (2).

Using an exponential backoff for retrying failed bootstraps after the first one is an optimisation in my view. It requires additional considerations such as inspecting the reason why the bootstrap failed, what the interval should be etc.

Based on @mxinden's comment above, I think he agrees with this proposal as well. Please go ahead with that @PanGan21 and thank you for working on this!

thomaseizinger avatar Nov 22 '23 22:11 thomaseizinger

  1. we bootstrap on new connections until one succeeds then go to 2

What I wanted to add to this is, we should only start the bootstrap iff the node supports our kademlia protocol. We should learn this briefly after the connection is established, assuming we and the remote support the identify protocol.

This will work great out of the box in the common case. For anything else, the user can still manually call bootstrap.

thomaseizinger avatar Nov 22 '23 22:11 thomaseizinger

we bootstrap on new connections until one succeeds then go to 2

How about "we bootstrap on new-node-in-kademlia-routing-table [...]"?

  1. This would ensure the constrained proposed by @thomaseizinger above, namely that we should only bootstrap with a node iff they support the Kademlia protocol. (Nodes are only added to the routing table automatically if they support the protocol.)
  2. On start-up of a node, some users call Behaviour::add_address without actually connecting to it (i.e. without Swarm::dial). Reacting on new-node-inkademlia-routing-table instead of new-connection would cover this use-case.

Otherwise, proposal above looks good to me.

mxinden avatar Nov 23 '23 08:11 mxinden

As @PanGan21 said, I think it makes sense to focus this PR on (1) & (2).

Yes, I totally agree. Part 3 can always be done later if needed, and in the meantime, end users can always catch the event QueryResult::Bootstrap and react accordingly calling the bootstrap function if they want.

What I wanted to add to this is, we should only start the bootstrap if the node supports our kademlia protocol.

Yes indeed. I forgot to mention it.

On start-up of a node, some users call Behaviour::add_address without actually connecting to it (i.e. without Swarm::dial). Reacting on new-node-inkademlia-routing-table instead of new-connection would cover this use-case.

I think we do this sometimes :joy: It might indeed be better to do as you propose @mxinden.

stormshield-frb avatar Nov 23 '23 09:11 stormshield-frb

2. On start-up of a node, some users call Behaviour::add_address without actually connecting to it (i.e. without Swarm::dial). Reacting on new-node-inkademlia-routing-table instead of new-connection would cover this use-case.

I get the idea. We "trust" those addresses being kademlia nodes, right? How would this interact with #4302? Would we still have an add_address function on kademlia that acts as a "I know what I am doing, just add this address"-kind of function? Or would all addresses we share across the behaviours be added to the table?

I think overall, reacting on "new node in routing table" is good so happy to move forward with this idea! I think my concern is more about the behaviour of add_address but that doesn't need to block this PR.

thomaseizinger avatar Nov 23 '23 21:11 thomaseizinger

From what I understand the condition new-node-inkademlia-routing-table is actually true when there is an Event::RoutingUpdated with is_new_peer: true. Correct me if I am wrong or missing something

PanGan21 avatar Nov 25 '23 10:11 PanGan21

From what I understand the condition new-node-inkademlia-routing-table is actually true when there is an Event::RoutingUpdated with is_new_peer: true. Correct me if I am wrong or missing something

That sounds correct! :)

thomaseizinger avatar Nov 25 '23 21:11 thomaseizinger

This pull request has merge conflicts. Could you please resolve them @PanGan21? 🙏

mergify[bot] avatar Nov 30 '23 07:11 mergify[bot]

This pull request has merge conflicts. Could you please resolve them @PanGan21? 🙏

mergify[bot] avatar Dec 01 '23 23:12 mergify[bot]

This pull request has merge conflicts. Could you please resolve them @PanGan21? 🙏

mergify[bot] avatar Dec 03 '23 22:12 mergify[bot]

LGTM, thank you very much! :)

Will give @mxinden a few days to review this too.

Still need to fix the failing tests 😅, so i will start with this now since the logic is good

PanGan21 avatar Dec 03 '23 23:12 PanGan21

LGTM, thank you very much! :) Will give @mxinden a few days to review this too.

Still need to fix the failing tests 😅, so i will start with this now since the logic is good

I am surprised there are any failing tests. We are just adding new code, not changing existing functionality.

thomaseizinger avatar Dec 04 '23 00:12 thomaseizinger

Since I did not know how to contribute to your work @PanGan21 without breaking everything or duplicating PR, I did a separate branch in my fork to showcase what I have in mind: https://github.com/stormshield-frb/rust-libp2p/pull/1/files

I tested this patch in our project and ran our integration tests with our manual bootstrap system disabled and this automatic bootstrap system enabled. All our tests are green. Of course, it does not mean that it is perfect because I don't think our tests cover 100% of every possible situations :joy:

Let me know what you think about it :blush: hope it helps.

stormshield-frb avatar Dec 04 '23 16:12 stormshield-frb

Since I did not know how to contribute to your work @PanGan21 without breaking everything or duplicating PR, I did a separate branch in my fork to showcase what I have in mind: https://github.com/stormshield-frb/rust-libp2p/pull/1/files

I tested this patch in our project and ran our integration tests with our manual bootstrap system disabled and this automatic bootstrap system enabled. All our tests are green. Of course, it does not mean that it is perfect because I don't think our tests cover 100% of every possible situations 😂

Let me know what you think about it 😊 hope it helps.

Hey @stormshield-frb , thanks a lot for helping and I appreciate your work! It looks much cleaner than mine to be honest. I am fine to close this pr and let you take it over of course if you want because it is your work! Otherwise I can follow what the maintainers suggest, or integrate your suggestions here.

PanGan21 avatar Dec 04 '23 18:12 PanGan21