polkadot icon indicating copy to clipboard operation
polkadot copied to clipboard

Buffered connection management for collator-protocol

Open slumber opened this issue 3 years ago • 2 comments

Resolves https://github.com/paritytech/polkadot/issues/5062 Might help with #4617 Needed for #5054 (easily rebased on the feature branch thus targets master)

slumber avatar Sep 16 '22 16:09 slumber

Connected validators are updated on each new advertisement the same way as before, this way we keep the connection to the current group alive.

For async backing there's one thing to keep in mind: a collation doesn't leave the view immediately once the leaf gets deactivated. If a validator doesn't fetch a collation within some timeout, we should reset the corresponding bit ourselves and disconnect if there're no more collations to send

slumber avatar Sep 17 '22 07:09 slumber

Note that there's also https://github.com/paritytech/polkadot/issues/5062 which is not addressed

Cleaning up a connection too soon may make things worse because we work in a low latency environment and collator's time window for producing and distributing a candidate is tight

On the other hand, if e.g. the availability is slow and the core stays occupied, we will stay connected to a group needlessly, but in this case validators will simply time us out which I don't think much of a deal.

UPD: after discussion it was decided to cleanup connections on timeout

slumber avatar Sep 18 '22 07:09 slumber

We need one more review on this one, @tdimitrov ?

eskimor avatar Sep 28 '22 08:09 eskimor

@ordian unfortunately refactoring collator protocol right now is not the best idea, we will have a lot of changes with async backing. This code was initially intended for async backing only, and the hack is present because it's adapted for master branch. In the feature branch we will simply reconnect on every activation.

slumber avatar Sep 28 '22 11:09 slumber

Only thing missing is Versi burnin. Or did we have one already?

eskimor avatar Oct 03 '22 09:10 eskimor

Only thing missing is Versi burnin. Or did we have one already?

on it

slumber avatar Oct 03 '22 09:10 slumber

After a burn-in I discovered that 12-secs reconnect hack can cause an issue if collator doesn't see activations for some reason. Since advertisements interest is not updated, collator will continuously attempt to reconnect to previously advertised validators.

Another option discussed with @eskimor was to reconnect on every activation. This solves the above issue in a way that validator would simply time us out. Also there's a 6 seconds timeout for fetching a collation, basically making it the same logic that would be used for async backing. This is needed to keep bits consistent.

slumber avatar Oct 04 '22 15:10 slumber

After a burn-in I discovered that 12-secs reconnect hack can cause an issue if collator doesn't see activations for some reason.

In case of a finality stall, relay chain block times would be (intentionally) high up to several minutes. Is that what you've observed?

Another option discussed with @eskimor was to reconnect on every activation. This solves the above issue in a way that validator would simply time us out.

Is that the same thing suggested in https://github.com/paritytech/polkadot/issues/4435 which we implemented and ended-up reverting?

If we're trying to solve paritytech/polkadot#5062 here, that's not the way.

ordian avatar Oct 04 '22 15:10 ordian

In case of a finality stall, relay chain block times would be (intentionally) high up to several minutes. Is that what you've observed?

Probably yes.

Another option discussed with @eskimor was to reconnect on every activation. This solves the above issue in a way that validator would simply time us out.

Is that the same thing suggested in paritytech/polkadot#4435 which we implemented and ended-up reverting?

If we're trying to solve paritytech/polkadot#5062 here, that's not the way.

No, this PR introduces a state for tracking validators we want to be connected to, the question is when we want to connect.

Before the latest fix: Connect on advertisements + at least every 12 seconds

After the fix: Connect on advertisements + at each activation + timeout validators that don't fetch within 6 seconds

slumber avatar Oct 04 '22 15:10 slumber

I am not sure I understand the issue. What does continuously attempting to reconnect mean. Like at what level? Are we renewing connection requests constantly - if so where and how. Could you break down a bit more what you think is happening and why it is undesired?

eskimor avatar Oct 04 '22 15:10 eskimor

I am not sure I understand the issue. What does continuously attempting to reconnect mean. Like at what level? Are we renewing connection requests constantly - if so where and how. Could you break down a bit more what you think is happening and why it is undesired?

  1. We advertised collation to group, set their bits to 1 for this relay parent and connect.
  2. They don't fetch and finality stalls.
  3. Every 12 seconds we see that there're validators we want to be connected to according to the buffer bits and issue connection request to them (even though they've probably already timed us out)

slumber avatar Oct 04 '22 15:10 slumber

See 2ed7f7a54924e9485cd0b84fe63002e6927e6e8d for details

slumber avatar Oct 04 '22 15:10 slumber

Just to clarity: connect_to_validators doesn't issue a connection request per se. It adds validators to the reserved peerset. That means as long as we have them there and we're not connected to them, we'll try to connect (much more frequent than every 12 seconds). When we change the reserved set, e.g. set it to empty, we'll disconnect from the difference of old - new. If a validator is in our reversed, but he disconnects us, we'll still try to connect after disconnection.

ordian avatar Oct 04 '22 15:10 ordian

Just to clarity: connect_to_validators doesn't issue a connection request per se. It adds validators to the reserved peerset. That means as long as we have them there and we're not connected to them, we'll try to connect (much more frequent than every 12 seconds). When we change the reserved set, e.g. set it to empty, we'll disconnect from the difference of old - new. If a validator is in our reversed, but he disconnects us, we'll still try to connect after disconnection.

Well then this fix doesn't really solve finality stall issue, thank you for clarification. We could timeout validators if they don't fetch after 6 seconds and make sure we reconnect (update reserved peerset) at least once every 12 seconds.

OR

Disconnect from everyone if there're no advertisements within next 12 seconds. This doesn't update the buffer thus a little dirty

slumber avatar Oct 04 '22 15:10 slumber

Uuh got it. Refreshing connections all the time on finality stall does not make too much sense indeed. So let's use leaf activation as refresh reason then.

eskimor avatar Oct 04 '22 15:10 eskimor

Uuh got it. Refreshing connections all the time on finality stall does not make too much sense indeed. So let's use leaf activation as refresh reason then.

this would still keep validators in reserved peerset as @ordian noted.

slumber avatar Oct 04 '22 15:10 slumber

Ok, when revisiting, I remember what I did not understand. Why is 3 happening? If we timeout after 12s the validator interest, we should disconnect - no?

eskimor avatar Oct 04 '22 16:10 eskimor

Ok, when revisiting, I remember what I did not understand. Why is 3 happening? If we timeout after 12s the validator interest, we should disconnect - no?

Previous approach is to connect at least once every 12 seconds based on the buffer without resetting the bits.

slumber avatar Oct 04 '22 16:10 slumber

As a simple safeguard for master branch I added a logic to remove all advertisements from the buffer on timeout.

This is correct because usually the advertisement goes out of view, if it doesn't and the timeout is hit (we had no new collations) there's no point in being connected to anyone, validators had plenty of time to fetch.

This won't work for async backing, collator can keep producing collations one after another and this timeout is never reached. Needs something smarter.

slumber avatar Oct 04 '22 18:10 slumber