subtensor icon indicating copy to clipboard operation
subtensor copied to clipboard

SOP for recovering from outage

Open orriin opened this issue 9 months ago • 0 comments

The saying "prevention is better than the cure" overwhelmingly applies when it comes to chain outages, and we must always first and foremost do everything possible to avoid them happening in the first place.

However, at the end of the day we are all humans who make mistakes, and even the largest chains (Solana, Polkadot, Bitcoin) have at points experienced devastating outages and required intervention from developers to get back online.

A chain outage, even though unlikely, is a catastrophic event making it imperative that we are prepared for the occurrence and have an SOP ready to action in the event that we need to rollback the chain.

The SOP should define clear steps undertaken by 3 actors

  • one from nucleus (responsible for node related tasks)
  • one from medula (responsible for OpenTensor infra related tasks)
  • one from leadership (responsible for community updates)

to swiftly restore chain operation, keep the community updated, and eventually publish a post-mortem and ensure steps are put in place to prevent re-occurrence of the issue.

orriin avatar May 20 '24 08:05 orriin