levant icon indicating copy to clipboard operation
levant copied to clipboard

canary-auto-promote not working correctly

Open vvitayau opened this issue 5 years ago • 2 comments

Levant canary-auto-promote flag does not appear to be working. Successfully deployed canary and Deployment is still running and requires manual promotion. Are my nomad/server/update default configuration preventing canary from auto-promoting ?

$ levant deploy -canary-auto-promote=15 nomad/nginx.nomad
2018-10-29T19:16:38Z |INFO| template/funcs: using default Consul KV variable with value dc1
2018-10-29T19:16:38Z |INFO| template/funcs: using default Consul KV variable with value 1
2018-10-29T19:16:38Z |INFO| template/funcs: using default Consul KV variable with value true
2018-10-29T19:16:38Z |INFO| template/funcs: using default Consul KV variable with value 1
2018-10-29T19:16:38Z |INFO| template/funcs: using default Consul KV variable with value 1m
2018-10-29T19:16:38Z |INFO| template/funcs: using default Consul KV variable with value 1
2018-10-29T19:16:38Z |INFO| template/funcs: using default Consul KV variable with value 15s
2018-10-29T19:16:38Z |INFO| template/funcs: using default Consul KV variable with value 0
2018-10-29T19:16:38Z |INFO| template/funcs: using default Consul KV variable with value 2m
2018-10-29T19:16:38Z |INFO| template/funcs: using Consul KV variable with key service/nginx/image and value nginx:latest
2018-10-29T19:16:38Z |INFO| template/funcs: using default Consul KV variable with value 20
2018-10-29T19:16:38Z |INFO| template/funcs: using default Consul KV variable with value 10
2018-10-29T19:16:38Z |INFO| template/funcs: using default Consul KV variable with value 1
2018-10-29T19:16:38Z |INFO| levant/plan: group server and task nginx plan indicates change of Config:image from nginx:stable-alpine to nginx:latest
2018-10-29T19:16:38Z |INFO| levant/deploy: using dynamic count 1 for group server job_id=nginx
2018-10-29T19:16:38Z |INFO| levant/deploy: triggering a deployment job_id=nginx
2018-10-29T19:16:38Z |INFO| levant/deploy: evaluation 75fa79dc-97c9-aba0-ec0c-4b3d951886fd finished successfully job_id=nginx
2018-10-29T19:16:38Z |INFO| levant/deploy: job is not configured with update stanza, consider adding to use deployments job_id=nginx
2018-10-29T19:16:38Z |INFO| levant/job_status_checker: job has status running job_id=nginx
2018-10-29T19:16:38Z |INFO| levant/job_status_checker: all allocations in deployment of job are running job_id=nginx
2018-10-29T19:16:38Z |INFO| levant/deploy: job deployment successful job_id=nginx

$ nomad status nginx
...
Latest Deployment
ID          = 6785aa24
Status      = running
Description = Deployment is running but requires promotion

Deployed
Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy
server      true         false     1        1         1       1        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
e9e9e3ae  03ef581e  server      2        run      running  14m38s ago  14m12s ago
75924020  03ef581e  server      1        run      running  1h8m ago    1h8m ago

Relevant Nomad job specification file

job "nginx" {
  region      = "global"
  datacenters = [ "[[ consulKeyOrDefault "service/nginx/datacenters" "dc1" ]]" ]
  type        = "service"

  group "server" {
    count = [[ consulKeyOrDefault "service/nginx/server/count" "1" ]]

    update {
      auto_revert       = [[ consulKeyOrDefault "service/nginx/server/update/auto_revert" "true" ]]
      canary            = [[ consulKeyOrDefault "service/nginx/server/update/canary" "1" ]]

      # deadline before automatically transitioned to unhealthy
      healthy_deadline  = "[[ consulKeyOrDefault "service/nginx/server/update/healthy_deadline" "1m" ]]"

      max_parallel      = [[ consulKeyOrDefault "service/nginx/server/update/max_parallel" "1" ]]

      # must be in healthy state before it is marked as healthy
      min_healthy_time  = "[[ consulKeyOrDefault "service/nginx/server/update/min_healthy_time" "15s" ]]"

      # the first to be marked as unhealthy causes the deployment to fail
      progress_deadline = "[[ consulKeyOrDefault "service/nginx/server/update/min_healthy_time" "0" ]]"

      # delay between migrating allocations off nodes marked for draining
      stagger = "[[ consulKeyOrDefault "service/nginx/update/server/update/stagger" "2m" ]]"
    }

    task "nginx" {
      driver = "docker"
      config {
        image       = "[[ consulKeyOrDefault "service/nginx/image" "nginx:stable-alpine" ]]"
        dns_servers = ["169.254.1.1"]
        port_map = {
          http = 80
        }
      }

      service {
        port = "http"
        name = "canary-nginx"
        canary_tags = [
          "traefik.enable=true",
        ]
      }

      service {
        port = "http"
        name = "nginx"
        tags = [
          "traefik.enable=true",
        ]
      }

      resources {
        cpu    = [[ consulKeyOrDefault "service/nginx/resources/cpu" "20" ]]
        memory = [[ consulKeyOrDefault "service/nginx/resources/memory" "10" ]]
        network {
          mbits = [[ consulKeyOrDefault "service/nginx/resources/network/mbits" "1" ]]
          port "http" {
          }
        }
      }
    }
  }
}

Output of levant version:

Levant v0.2.5
Date: 2018-10-25T13:24:11Z
Commit: 0514741514e70caf82976c2c67f98414046b2392
Branch: 0.2.5
State: 0.2.5
Summary: 0514741514e70caf82976c2c67f98414046b2392

Output of consul version:

Consul v1.4.0-rc1 (1757fbc0a)

Output of nomad version:

Nomad v0.8.6 (ab54ebcfcde062e9482558b7c052702d4cb8aa1b+CHANGES)

Additional environment details:

Debug log outputs from Levant:

vvitayau avatar Oct 29 '18 19:10 vvitayau

thanks for the detailed report @vvitayau, i'll take a look into this when I can which will hopefully be soon and get back to you.

jrasell avatar Oct 29 '18 19:10 jrasell

We are experiencing this as well. It appears to happen when the update clause is specified at the group level rather than at the job level (which is valid). This seems to be the offending line of code: https://github.com/jrasell/levant/blob/0514741514e70caf82976c2c67f98414046b2392/levant/deploy.go#L151 Workaround is to move the update stanza to the job level, which is acceptable if the different groups all need similar update configurations.

margueritepd avatar Nov 08 '18 20:11 margueritepd