infra icon indicating copy to clipboard operation
infra copied to clipboard

Orchestrator rolling updates with job definiton

Open jakubno opened this issue 8 months ago • 5 comments

Description

  • This will allow us to update orchestrator job definition without causing downtime
  • Also it improves observability how many orchestrators are running for the current version

How it works

We generate new unique id if either job definition, secret or orchestrator binary changed We use this ID to generate a new job, which has the new job definition We save the ID to nomad as a variable and theres's a prestart check, which compares the ID of the job with the latest ID (the one saved in nomad), if they don't match the orchestrator is not started

This means there will be multiple jobs for orchestrator, but new orchestrator will start only for the latest job

jakubno avatar Apr 11 '25 11:04 jakubno

I think the migration could be the following:

  1. Adjust the priority of the new job so it is evaluated before the old orchestrator job—it will block the old orchestrator job from being deployed for the new nodes
  2. Deploy the new job once and roll all orchestrators
  3. Remove the old orchestrator job

The only question left is how to delete the old jobs that are unused.

ValentaTomas avatar Apr 12 '25 09:04 ValentaTomas

@jakubno The priority on the new job cannot make it so that the new job is evaluated before the currently running orchestrator? Thinking if we even need the wait at all.

ValentaTomas avatar Apr 22 '25 11:04 ValentaTomas

The only question left is how to delete the old jobs that are unused.

Also we need to solve this before merging.

ValentaTomas avatar Apr 22 '25 13:04 ValentaTomas

The only question left is how to delete the old jobs that are unused.

Also we need to solve this before merging

There won't be that many of them, deploying new orchestrator version is now rather slow process

jakubno avatar May 07 '25 16:05 jakubno

Wait for #647 is deployed

jakubno avatar May 15 '25 12:05 jakubno