az-hop icon indicating copy to clipboard operation
az-hop copied to clipboard

document no-downtime upgrade path

Open ltalirz opened this issue 1 year ago • 0 comments

In what area(s)?

/area administration

/area ansible /area autoscaling /area configuration /area cyclecloud /area documentation /area image /area job-scheduling /area monitoring /area ood /area remote-visualization /area user-management

Describe the feature

AZ-HOP has been evolving rapidly over the past couple of months; with new features requiring upgrades to existing clusters that will often disrupt users' work if done "in-place".

An alternative upgrade route is to create a new az-hop cluster

  • that mounts the same ANF, and
  • whose vnet is peered with the old cluster

so that users can connect from the old system gradually without disruption of their work.

After the new deployment has passed all relevant tests, users can be asked to submit new jobs on the new system. Once all jobs on the old system have completed, it can be shut down and removed.

It would be great to write down a detailed guide on how to perform upgrades in this way. There may be some further details (e.g. you will likely also want to mirror / use the same keyvault + potentially slurm job database).

cc @matt-chan

ltalirz avatar Mar 05 '23 23:03 ltalirz