nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Auto-promote for canaries changes to manual promotion when a canary can't initially get placed in the cluster

Open jvineet opened this issue 1 year ago • 3 comments

Nomad version

Nomad v1.7.6
BuildDate 2024-03-12T07:27:36Z
Revision 594fedbfbc4f0e532b65e8a69b28ff9403eb822e

Operating system and Environment details

Ubuntu 22.04.4 LTS (Jammy Jellyfish)

Issue

We have had frequent issues recently where our nomad job deployments often end up getting their canaries stuck in "Manual promotion" state even when we have canary auto-promotion turned on. This behavior seems to have especially taken off after upgrading to Nomad v1.7.

Upon investigating this problem, we noticed that the canaries were getting stuck in a manual promotion state only when the canary was unable to get placed in the cluster immediately due to resource constraints and had to wait for a new nomad agent to come up in order to make extra capacity to place the canary. So the eval status would show "True" for "Placement Failures". The canary would eventually be deployed to a new nomad agent after a while and turn healthy, but the promotion strategy for the entire deployment will switch from auto-promote to manual promotion.

Screen Shot 2024-04-05 at 4 36 16 PM

We were able to reproduce this behavior by adding a placement constraint on a job with bogus meta label so the job had a placement failure when deploy started, and sure enough it put the “auto-promote” enabled canary in a “manual promotion” state. Here are some screenshots:

Screen Shot 2024-04-05 at 4 29 24 PM Screen Shot 2024-04-05 at 4 25 46 PM Screen Shot 2024-04-05 at 4 30 33 PM Screen Shot 2024-04-05 at 4 19 56 PM

This is causing a lot of continuous disruption of our services, primarily since we have autoscaling turned on for a lot of our nomad jobs. This issue effectively disables scaling activities for a job when it gets stuck in such a state. This can be especially damaging if something like this happens in a job when it is on the low end of its scaled state as it can't scale up and add new allocs to meet with increasing demand.

We shouldn't have to manually promote canaries ourselves whenever a canary momentarily finds itself out of resource capacity for a brief period of time, as it waits for the cluster to scale out. Can someone in the team take a look at this problem and if they can reproduce it, then release a quick patch with a hotfix if it's not too much trouble. This kind of appears counter-productive to a CD/scalable infra strategy.

Reproduction steps

See above

Expected Result

canary should get auto-promoted when it is eventually successfully deployed to a nomad agent.

Actual Result

canary gets stuck in a "Manual Promotion" state, even when auto-promotion for it has been enabled in the nomad job specs

jvineet avatar Apr 05 '24 23:04 jvineet

Hi @jvineet! I wasn't quite able to reproduce what you're seeing. So let me walk you through my reproduction and maybe there's something I'm missing here...

I'm using the following jobspec. Note the constraint on node metadata.

jobspec
job "example" {

  update {
    max_parallel     = 1
    min_healthy_time = "30s"
    auto_promote     = true
    auto_revert      = false
    canary           = 1
  }

  group "web" {

    count = 3

    constraint {
      attribute = "${meta.example}"
      operator  = "="
      value     = "foo"
    }

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    task "http" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      resources {
        cpu    = 100
        memory = 100
      }

    }
  }
}

I ensure the node has the correct metadata, and then deploy it so there's a version 0 to canary from:

$ nomad node meta apply example=foo

$ nomad job run ./example.nomad.hcl
==> 2024-04-08T16:31:07-04:00: Monitoring evaluation "937f5c99"
    2024-04-08T16:31:07-04:00: Evaluation triggered by job "example"
    2024-04-08T16:31:08-04:00: Evaluation within deployment: "bcca1de9"
    2024-04-08T16:31:08-04:00: Allocation "765877fc" created: node "008f7f36", group "web"
    2024-04-08T16:31:08-04:00: Allocation "a68710f9" created: node "008f7f36", group "web"
    2024-04-08T16:31:08-04:00: Allocation "d82ff360" created: node "008f7f36", group "web"
...
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    web         3        3       3        0          2024-04-08T16:41:38-04:00

Then I changed the node metadata so the canary for the next deploy can't be placed:

$ nomad node meta apply example=bar

I updated the job resources.cpu=101 to force allocation replacements, and re-run the job, and get what I'd expect:

$ nomad job run ./example.nomad.hcl
==> 2024-04-08T16:33:15-04:00: Monitoring evaluation "da05d13a"
    2024-04-08T16:33:15-04:00: Evaluation triggered by job "example"
    2024-04-08T16:33:16-04:00: Evaluation within deployment: "2550144d"
    2024-04-08T16:33:16-04:00: Evaluation status changed: "pending" -> "complete"
==> 2024-04-08T16:33:16-04:00: Evaluation "da05d13a" finished with status "complete" but failed to place all allocations:
    2024-04-08T16:33:16-04:00: Task Group "web" (failed to place 1 allocation):
      * Constraint "${meta.example} = foo": 1 nodes excluded by filter
    2024-04-08T16:33:16-04:00: Evaluation "7d3d3562" waiting for additional capacity to place remainder
==> 2024-04-08T16:33:16-04:00: Monitoring deployment "2550144d"
  ⠸ Deployment "2550144d" in progress...

    2024-04-08T16:33:16-04:00
    ID          = 2550144d
    Job ID      = example
    Job Version = 1
    Status      = running
    Description = Deployment is running pending automatic promotion

    Deployed
    Task Group  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
    web         false     3        1         0       0        0          N/A^C

So the canary has failed. If I look at the web UI at this point, I do see something weird, which is that the UI is showing the yellow box "Deployment requires promotion". This is weird, because you can't promote a deployment that hasn't placed a canary yet. However, note from the output above that the deployment actually isn't waiting for manual promotion. So there may be a web UI bug here.

Screenshot at 2024-04-08 16-37-22

Next I update the node metadata to allow the canary to be placed:

$ nomad node meta apply example=foo

And at that point, the deployment gets unstuck and continues normally, automatically promoting itself as expected. So while there definitely appears to be a web UI bug in how the placement issue with the canary is being reported, the deployment actually does automatically promote itself as we'd expect.

Screenshot at 2024-04-08 16-38-09

$ nomad deployment status
ID        Job ID   Job Version  Status      Description
2550144d  example  1            successful  Deployment completed successfully
93b30da6  example  0            successful  Deployment completed successfully

@jvineet do you have a similarly minimal reproduction of the issue you're seeing? Can you confirm that the problem is that the deployment cannot continue at all, and that it's not just the UI reporting the wrong thing? (That's a bug too, but different in kind.)

tgross avatar Apr 08 '24 21:04 tgross

@tgross thank you for the quick response. This is definitely stalling our deployments indefinitely until we unblock it by some kind of manual intervention. So I can't imagine this is limited to a UI bug.

The example for reproduction I listed above was only to show that an initial placement failure on the canary switches the deployment to "manual promotion" or something to that effect. I understand that in that example a canary can never get placed so deployment will naturally stall forever.

But what we are seeing in our production environment, for which it is very difficult to create a fully reproducible example, is that when a canary couldn't be placed immediately and it had to wait for for cluster to scale-out, it still had a "Placement Failure" which we could confirm in the eval status. And that had the same effect where nomad UI was showing Deployment requires manual promotion. But a new alloc was ultimately placed when a new client was added, and yet the deployment never snapped out of that state. It continued showing Deployment is running pending automatic promotion in the deploy status and the manual promotion message in the UI.

We use nomad-autoscaler for scaling our jobs, and the nomad autoscaler also suspends all scaling activities for that job when this happens. Nomad-autoscaler logs report something akin to "the job is undergoing an active deployment so skipping it for now".

Is it possible that there is some latent timeout to how soon a canary needs to be placed. It can take up to 5 mins for a new client to come up so the canary might be waiting for some amount of time with a "Placement failure" before it can find enough capacity to be placed on a node.

jvineet avatar Apr 08 '24 21:04 jvineet

Ok @jvineet can you provide an example jobspec (even if redacted in terms of names of tasks)? There may be some interaction between reschedule blocks, update blocks, etc. that's unforeseen.

tgross avatar Apr 09 '24 11:04 tgross

We also noticed something similar after updating one of our clusters to Nomad 1.7.5. Some jobs with auto_promote: true would get stuck for several minutes before getting auto promoted, and would show the "requires manual promotion" in the web GUI.

We didn't really look into what caused it, but we're positive that it, wasn't caused by unfulfilled placement constraints in our case.

Our issues went away after we triggered a leader election, causing a new node to become leader.

EDIT: Please note that this issue only presented itself for one out of three clusters that we updated to 1.7.5.

ostkrok avatar Apr 15 '24 12:04 ostkrok

@tgross Thank you for your patience. We actually did a deeper investigation on our end last week where we found that this was ultimately happening due to falco tainting our linux kernel. That was causing issues with docker daemon as it would try to launch containers, especially around the time a new EC2 instance would boot up.

We were able to replicate these conditions in our test environment by launching allocs that would require a scale-out event in our nomad fleet before they could be placed. We noticed that often when a new alloc would get placed in this scaled-out nomadclient, the containers would not start successfully and would often require multiple restarts/reschedules. Nomad task page would show multiple occurrences of Exit Code 137, Exit Message: Docker container exited with non-zero exit code: 137".

image

We were sure these were not OOM kill events and we were failing to see any startup logs from the svcs either. We first confirmed that the containers were in fact struggling to come up so as to rule out Nomad entirely. Sure enough, we saw that the docker daemon would try and launch containers for these allocs, which would then exit immediately, whithout any explanation. This led us to focus our investigation outside of nomad and into docker daemon.

We didn't get anything useful from docker-daemon logs, but we noticed that the kernel logs would log a gunicorn and a python process failure every time our containers would die (our allocs runs a gunicorn server and a sidecar running some python daemon).

image

We dug deeper into the kernel logs in those machines and noticed that our kernel was getting tainted by a falco kernel module on instance boot.

image

At first we weren't sure what this all meant exactly, but this doc on "tainted kernels" helped us make some sense of it.

We searched and came across other logs that looked like

BUG: kernel NULL pointer dereference, address: 0000000000000000
...
RIP: 0010:ppm_is_upper_layer+0x4f/0x80 [falco]

and were occurring in lockstep with the process failure logs above.

This led us to test for change in behavior after removing falco from our base AMI images and as soon as we launched the new AMIs, all of our issues seemingly went away. We let this run over the weekend and we can confirm that nomad logs that look like

Not Restarting ... Exceeded allowed attempts ... failed=true

have also largely gone away since we deployed our AMIs without falco, indicating that things are now looking more stable overall in our nomad cluster.

image

The only thing I'd like to add here that could be relevant to nomad is that sometimes we noticed that a deployment that was triggered from a simple count change in allocs (like scaling event from nomad-autoscaler), would deploy a canary. We thought this was odd, since a simple count change should not trigger canaries as that is a non-destructive deployment. This is most likely why some of our deploys were getting stuck in a manual canary state when the initial canary would fail to come up successfully in an auto-scaled client (due to the above tainted kernel problem) since nomad was unable to find a healthy version to revert to, as the only changes in all previous versions were just alloc count changes.

But at this point I can't be sure if this behavior of "canary on alloc count change" itself was somehow not another artifact of our tainted kernel. Regardless, I thought I'd add this info here in case someone else happens to come across it in the future.

jvineet avatar Apr 15 '24 22:04 jvineet

The only thing I'd like to add here that could be relevant to nomad is that sometimes we noticed that a deployment that was triggered from a simple count change in allocs (like scaling event from nomad-autoscaler), would deploy a canary. We thought this was odd, since a simple count change should not trigger canaries as that is a non-destructive deployment.

Good to point it out, but that is expected. The way that "scaling" works in Nomad is to update the job spec, which ends up kicking off a deployment even if the task groups are otherwise the same. The existing allocations are updated non-destructively (i.e. their job version is bumped, but they are otherwise untouched).

It looks like you've resolved the issue here. If you figure out why Falco broke the application I'd love to hear more just out of morbid curiosity. :grinning: But it looks like we can close this issue out.

@ostkrok if you're still encountering a problem, do me a favor and open a new issue with more details so we can debug that.

tgross avatar Apr 16 '24 12:04 tgross

We've got a fix for the UI issue in https://github.com/hashicorp/nomad/pull/20408 as well.

tgross avatar Apr 16 '24 14:04 tgross

Good to point it out, but that is expected.

@tgross, just out of curiosity, what would be an instance or a condition where a scaling action could result in a canary deployment.

A different engineer is going to look into the falco issue, but I can update here when and how it was ultimately resolved. I think our first approach is going to be upgrading falco to the latest version, and see if that automagically resolves the problem.

jvineet avatar Apr 18 '24 19:04 jvineet