nomad icon indicating copy to clipboard operation
nomad copied to clipboard

blue/green deployments with spread result in unbalanced workload

Open hvindin opened this issue 4 years ago • 4 comments

If filing a bug please include the following:

Nomad version

Nomad v0.12.3 (2db8abd9620dd41cb7bfe399551ba0f7824b3f61)

Operating system and Environment details

NAME="Red Hat Enterprise Linux Server"
VERSION="7.7 (Maipo)"
ID="rhel"
VARIANT="Server"
VERSION_ID="7.7"

Issue

When trying to achieve an ideal even spread of job allocations between two consul DC's we are specifying

     "Spreads": [
        {
          "Attribute": "${attr.consul.datacenter}",
          "Weight": 50,
          "SpreadTarget": null
        }
      ],

in all of our job definitions. There are two consul datacenters with exactly the same number of workload servers which should theoretically mean that we ideally always end up with 50% of allocations in one datacenter and 50% in the other.

If something goes wrong, for example an ESXi host underlying the VM's running nomad dies, killing off a huge number of workload servers in one site or any other major disruption which causes nomad to start allocating jobs unevenly across consul datacenters (which is desirable during the outage event) it becomes impossible to rebalance the jobs by deploying a new blue/green declared job.

This appears to be because, as indicated by the documentation, we are specifying the same number of canaries as we are the eventually desired count for each TaskGroup.

So if we end up (in a simple example) with a job that starts out with 1 TaskGroup, declaring a Count of 4, with the above Spread we have seen the following events occur:

  1. Major disruption occurs in one site, jobs migrate to the available compute capacity in the still healthy site, for example the job we're following might end up with 4 allocs in the one site, with 0 allocs in the site which suffered the outage. (for the duration of the outage this is totally fine and actually desirable.
  2. The disruption is fixed, compute capacity is restored. Everyone is happy. There are still 4 allocs running in one site, 0 allocs in the now healthy site so obviously something needs to be done to get the jobs back to "roughly evenly distributed" to avoid the risk of a similar outage even hitting the site with all 4 allocs deployed and thus causing a total outage while the jobs are moved over to the healthy site again.
  3. To work around this one would immediately assume that deploying the job as a regular blue/green deployment as we have set up elsewhere would result in the new allocations being distributed 2/2 across the data centers.

However, what we have noticed actually occurs is that when we schedule a job with 4 canaries, to match up with the job with 4 count, all 4 canaries are spun up in the consul datacenter with 0 allocs (which I guess for that specific moment in time does technically result in a 50/50 even split across the datacenters) and then when the canaries come healthy and are marked for promotion the 4 allocs that were there previously spin down. Leaving us, once again, with 4 allocations in one datacenter and 0 allocations in the other.

The only way we have been able to work around this is by messing with the canary count on the job to get less canaries deployed so that we get a chance of the job coming up evenly.

This seems like it's possibly not the intent of the "Spread" functionality as I would have thought that the "Spread" that you would declare in your job would be the desired state, after all the canaries were promoted, and other containers spun down etc. as opposed to the spread for when canaries are spun up which then becomes somewhat meaningless when they are spun down.

It's entirely possible that I'm simply misunderstanding the Spread functionality and the approach described for performing blue/green deployments, in which case let me know and a pointer in the direction of the docs which I've likely misread would be much appreciated, but it does strike me that this behaviour seems a little unexpected from what I've read about the Spread definition.

Reproduction steps

See above description of how the state was achieved.

hvindin avatar Oct 26 '20 23:10 hvindin