subtensor icon indicating copy to clipboard operation
subtensor copied to clipboard

Add exponential backoff config to AURA

Open sam0x17 opened this issue 1 month ago • 0 comments

Right now major migrations in subtensor are perilous. Under AURA consensus, the validators use a round-robin system to pick a validator each time to try to complete the migration. If they fail to do so within the 12 second time limit, a new validator is selected and the process continues. For migrations like the recent 1.0 upgrade, where the migration itself generally always takes more than 12 seconds, this will cause huge delays as the validators have to basically partially complete pieces of the migration, gossip those blocks to each other, and eventually randomly cobble together a complete version of the migration before finalization can continue, which can take hours.

Instead it would be much better if with each successive failing round, the 12 second time limit is increased by some scaling factor like 1.2x so that eventually the time limit will be long enough to complete any migration.

Presumably AURA already has some backoff setting that may or may not do what I describe above that simply needs to be turned on. We should definitely turn this on if so.

AC:

  • [ ] find out whether AURA's backoff setting does what we want
  • [ ] if it does, turn that on, if not, implement something that does something like this where successive round-robin failures result in higher and higher time limits using some fixed scaling factor.
  • [ ] profit?

sam0x17 avatar May 23 '24 07:05 sam0x17