polkadot icon indicating copy to clipboard operation
polkadot copied to clipboard

Collator Protocol: Connection Management

Open eskimor opened this issue 3 years ago • 9 comments

With a pre-validation function, we will be able to only accept collations from collators who can prove that they are in fact a valid block author for the next parachain block, as it is possible with for example AURA based parachains. This already greatly improves the reliability of the protocol, but checking a pre-validation function for every connection will still be a relatively expensive operation, especially in the context of lot's of connection requests, where we quickly need to decide which ones could be valid and which ones are definitely not (random nodes, no known collators).

So while a pre-validation function can protect us from misbehaving collators, I think the protocol would benefit from an additional mechanism, protecting us from misbehaving random nodes: We should have another function providing a list of CollatorIds - ideally even with PeerIds of currently known collators. With such a proven list, we can very quickly decide whether we want to accept a connection or not and can either restrict incoming connections to known collators or at least prioritize them.

Any now known collator can then connect and provide an updated list. Thus, as soon as a single valid collator was able to connect, non collator nodes will be having a tough time interrupting the service through the lifetime of the parachain, as connected collators will always be able to provide us with an updated list, which ensures connectivity for future generations/sessions.

Handover/Rotation

For this to work on validator group rotations, the previous group would need to inform the next group of the most current collator list for its parachain. So each validator will have to keep at least two collator lists: One of the next parachain it will rotate into and one of current one. With contextual execution/asynchronous backing they might need to keep three, to account for parachains using a somewhat older relay parent. So the three lists we would need to keep are:

  • The most current parachain's list we rotated into with our most current leaves
  • The one right before that
  • And the next one we will be rotated into on the next rotation

The easiest scheme would be to allow connections for all of those parachains' collators. More complex schemes could be to have some phase in/phase out semantics. E.g. we can drop the previous list after we are enough blocks into the current one, so we would not accept a block referencing those blocks anymore anyway. For the next one, we could only begin fetching the most current list at the last couple of blocks of the current rotation. Both of which are nice optimizations but might also be a good idea to limit any potential effects a malicious parachain could have on neighboring parachains, if they get access to the same validators at the same time.

Considerations

I think a system like this would greatly improve the collator - validator reliability for a parachain. By simply being able to reject any unknown nodes immediately and cheap, we can have a very robust system. The downsides:

  • Does not work for e.g. Proof of Work parachains - but the situation would not be any worse for them than it is now, we should just ensure that providing such a proved list is optional.
  • Using such a list imposes some upper bound on the number of collators, although not too much - even a list of thousands of collators should not be too large for getting handled properly. Keeping a few kilobytes in memory and transferring them every two minutes between a couple of nodes, should not be a problem.
  • The system as proposed would not work for parathreads.

Variations

Instead of making a standalone function for getting that collator list, we could also make it an optional result of the normal PVF execution. Like an event that gets triggered whenever the list changes. We could even consider putting those lists on the relay chain as part of the validation result. That would take up some chain storage, but would solve the parathread issue and would also take care of communicating the lists to the next backing group.

Alternate proposal by @rphmeier

With asynchronous backing we can afford having the block producer and the transmitting node not necessarily being the same. This way, collators can collaborate in order to deliver the block to the backing group.

Together with validators keeping track of simple reputation of last seen good nodes, even under attack collators can get the collation to one of those nodes and they will take care of getting it to the validators.

It would even be possible to completely split the role of parachain block producers and nodes getting that block to the backing group. E.g. having a number of nodes whose whole purpose is getting parachain blocks to the current backing group.

The details have to be figured out, but it definitely sounds like an interesting direction - especially together with PoV torrenting.

The semantics would become more gossip like: As long as some good node is able to establish a connection, the parachain will stay live.

Group rotations will still be tricky though, if validators don't keep track of good peers across rotation boundaries.

eskimor avatar Dec 28 '21 23:12 eskimor

We'll de facto limit the number of collators with something like this, but imho that's fine.

The pre-validation function limits connections much more tightly, but yes winds up more complex to check.

burdges avatar Dec 29 '21 11:12 burdges

Comment by @rphmeier :

We should just have the pre-validation proof in the handshake of the collator or something

so yeah, let's see how cheap we can make these.

eskimor avatar Apr 01 '22 13:04 eskimor

It'll always be one Merkle proof into a distinguished part of the chain state that only changes once per epoch:

  • Aura & Babe require the collator prove their existence.
  • Sassafras requires the collator prove their slot assignment.

If you want to postpone doing that for Auda & Babe then maybe some reputation trick works short term for them, and we just implement this for sassafras only and make everyone use that, assuming it works.

burdges avatar Apr 01 '22 18:04 burdges

so yeah, let's see how cheap we can make these.

How should that work? We will need the Parachain to provide this code to check the pre-validation and this could be anything. You can not really "make it cheap" because you don't know how this code will look like.

bkchr avatar Apr 05 '22 19:04 bkchr

If we find a way to make it super cheap and a parachain chooses to use something more expensive, it should not be our problem. Like:

  • Document that this check has to be cheap
  • Provide some cheap default solution
  • Parachain chooses to use something expensive :shrug:

That being said, given that the generality alone makes me dubious it can be cheap enough. Anyhow, we will need to do some real world testing to actually know requirements.

eskimor avatar Apr 06 '22 09:04 eskimor

it should not be our problem

If someone has an expensive check in there, it means that someone could use this to DOS relay chain validators? It would just require to send connection requests with some junk data that triggers the expensive check. Not sure we should ignore this :P

bkchr avatar Apr 06 '22 09:04 bkchr

Damn it, damn good point! :-)

Even if we time-limit the check, an attacker would still try to max out that limit and gets over proportional more time than honest nodes. So yeah, this can't easily work, it likely makes DoSing easier as opposed to harder.

eskimor avatar Apr 06 '22 09:04 eskimor

Related: https://github.com/paritytech/polkadot/issues/1348

eskimor avatar Apr 06 '22 10:04 eskimor

it should not be our problem

If someone has an expensive check in there, it means that someone could use this to DOS relay chain validators? It would just require to send connection requests with some junk data that triggers the expensive check. Not sure we should ignore this :P

All relay chain validators, or just ones on this core? If it's the latter than it is mostly just the parachain in question's problem.

AlistairStewart avatar Aug 10 '22 13:08 AlistairStewart

I'd think these wind up being bespoke for block production methods we provide: At least sassafras, babe, and aura should all be merkle proofs into specific parachain state, and a couple schnorr-like signature checks.

We could however make sassafras, babe, and aura code modular in the right ways so that parachains could tweak them without tweaking the relay chain imposed pre-validation function. It's all delicate of course..

burdges avatar Sep 28 '22 17:09 burdges

Closing as superseded.

eskimor avatar May 24 '23 09:05 eskimor