nomad change_mode = "script" sometime fails with Docker

Nomad version

Nomad v1.4.3 (f464aca721d222ae9c1f3df643b3c3aaa20e2da7)

Operating system and Environment details

AlmaLinux 8.7 Nomad installed manually from the pre-built binary. Using Docker CE 20.10.22

Issue

For some of my tasks, I trigger a script when a template changes. Typicaly, I use this to trigger a custom reload action when a vault generated certificate gets renewed. Most of the time, this is working great. But sometimes, the execution fails with this error

Template failed to run script /local/bin/rotate-cert.sh with arguments [] because task driver doesn't support the exec operation

Reproduction steps

A job with a templated cert, a reload script and change_mode = "script", for example with :

      template {
        data        = <<-EOF
          {{ with pkiCert "pki/db/issue/mongo" "common_name=mongo.service.consul" "ttl=5m" }}
          {{ .Cert }}
          {{ .Key }}
          {{ end }}
        EOF
        destination = "secrets/mongo.bundle.pem"
        change_mode = "script"
        change_script {
          command = "/local/bin/rotate-cert.sh"
        }
      }

Using a short TTL like in this example makes the problem more visible, but it's still quite random

Expected Result

The /local/bin/rotate-cert.sh script is fired everytime the cert is renewed

Actual Result

Most of the time, it's working, but sometimes (couldn't find any common pattern), the script fails with

Template failed to run script /local/bin/rotate-cert.sh with arguments [] because task driver doesn't support the exec operation

As a workaround, I've added the fail_on_error stanza

        change_script {
          command = "/local/bin/rotate-cert.sh"
          # If cert rotation fails, kill the task and restart it
          fail_on_error = true
        }

So when this happens, the task is killed and restarted (so it gets it's new cert)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Nothing interesting, we can see the template being rendered, and the task being stopped (because of the fail_on_error)

janv. 23 10:14:54 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:54.138+0100 [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/9b688daf-6269-eef6-1752-4f1db4413b3f/tracking-mongod/secrets/mongo.bundle.pem"
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.342+0100 [INFO]  client.driver_mgr.docker: stopped container: container_id=28a3bfc55b0d869f3abe3d2e67fa0ac2b26f1984133ee93da03d053fc110782d driver=docker
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.378+0100 [INFO]  agent: (runner) stopping
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.378+0100 [INFO]  agent: (runner) received finish
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.577+0100 [INFO]  client.driver_mgr.docker: stopped container: container_id=639b3b2291c13487a9c6e09e0dad620403938c9f1201cf33cc1f463f851e82af driver=docker
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.579+0100 [INFO]  client.driver_mgr.docker: stopped container: container_id=6103d61436006b17d96a1dc4a9f7f767e59eea9e6f782e0b8da5bbda2b344a81 driver=docker
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.613+0100 [INFO]  agent: (runner) stopping
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.614+0100 [INFO]  agent: (runner) received finish
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.628+0100 [INFO]  agent: (runner) stopping
janv. 23 10:14:59 ct-wn-3 nomad[3202532]:     2023-01-23T10:14:59.629+0100 [INFO]  agent: (runner) received finish

Jan 23 '23 09:01 dani

Hi @dani 👋

Thanks for the bug report. Do you have any other template in this task that could cause the container to restart? Or maybe something in the rotate-cert.sh script could be causing something like that?

Jan 23 '23 22:01 lgfa29

Nop, nothing specific in the script, which looks like

#!/bin/sh

mongosh --tlsAllowInvalidCertificates --eval <<_EOF
  db.adminCommand( { rotateCertificates: 1, message: "Rotating certificates" } );
_EOF

And nothing else which could be triggered at the same time. The error message is quite surprising. If the script was simply failing, it should be something else (don't remember the exact error message when the script fails, but it clearly state its exit code). In this case, it looks like the script isn't even fired ("because task driver doesn't support the exec operation"). The task restarting is because of the fail_on_error = true. The default behavior is for the change_script to fail (not being fired), and the task staying active, with a non working service as its certificate is now expired.

Jan 23 '23 22:01 dani

I was able to reproduce this with this job:

job "example" {
  datacenters = ["dc1"]

  group "client" {
    task "curl" {
      driver = "docker"

      config {
        image   = "curlimages/curl:7.87.0"
        command = "/bin/ash"
        args    = ["local/script.sh"]
      }

      template {
        data        = <<EOF
#!/usr/bin/env ash

while true; do
{{range nomadService "server"}}
  curl http://{{.Address}}:{{.Port}}/
{{end}}
  sleep 1
done
EOF
        destination = "local/script.sh"
        change_mode = "script"
        change_script {
          command = "/bin/ash"
          args    = ["local/change.sh"]
        }
      }

      template {
        data        = <<EOF
#!/usr/bin/env ash

date
echo "change"
EOF
        destination = "local/change.sh"

      }

      resources {
        cpu    = 10
        memory = 50
      }
    }
  }

  group "server" {
    network {
      port "http" {}
    }

    service {
      name     = "server"
      provider = "nomad"
      port     = "http"
    }

    task "http" {
      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "${NOMAD_PORT_http}", "-h", "/local"]
        ports   = ["http"]
      }

      template {
        data        = <<EOF
hello world
EOF
        destination = "local/index.html"
      }

      resources {
        cpu    = 10
        memory = 50
      }
    }
  }
}

Changing the server->http task and re-running the job will trigger the error.

The problem is that if the Nomad or Vault token changes the template hook will shutdown the existing template manager and create a new one.

The new template manager will not have access to the task driver handle that was set when in the Poststart hook since these are in-place updates.

#15915 restores the driver handler when the template manager is recreated.

@dani do you have more than one task in your group or are you using Vault by any chance?

Jan 26 '23 23:01 lgfa29

@dani do you have more than one task in your group or are you using Vault by any chance?

Yes to both, there're 3 tasks + one prestart. And using vault (to get certificates)

Jan 27 '23 08:01 dani

Ah cool, so yeah I think the reproduction I had was the main cause behind the issue you're having. The next release of Nomad will have a fix for it.

Jan 27 '23 17:01 lgfa29

Still experiencing this error in Nomad 1.5.6.

change_script runs every day. It can run successfully several times, but sooner or later I'm getting:

Template failed to run script /local/mariadb_ssl_reload.sh with arguments [] because task driver doesn't support the exec operation

Any suggestions?

Jun 20 '23 08:06 kkornienko-aparavi

Hi @kkornienko-aparavi 👋

Do you have any logs around when the problem happens? And would you be able to provide a sample of your job?

Jun 22 '23 21:06 lgfa29

Hello, @lgfa29 , thanks for reacting. Looks like I found the root cause of the issue. It happens right after nomad client is restarted. New process tries to re-render the templates and fails. Don't know is it at all should work or not )

Trying to post important parts of the job.

It's Grafana deployment job, which consists of 2 tasks. One is MariaDB and second is Grafana App. We use Vault with short-living certificates, we issue certificates and apply them to MariaDB. MariaDB requires some special command to be invoked to reload the certificates, that's why we use change script.

job "grafana" {

  dynamic "group" {
    for_each = var.db_enabled ? [1] : []
    labels   = ["db"]
    content {
      count = 1

      network {
        port "mariadb" {
          to = 3306
        }
      }

      volume "nfsgrafana" {
        type            = "csi"
        source          = "nfsgrafana"
        access_mode     = "multi-node-multi-writer"
        attachment_mode = "file-system"
      }

      task "db" {
        driver = "docker"
        vault {
          env      = false
          policies = [...]
        }
        volume_mount {
          volume      = "nfsgrafana"
          destination = "/bitnami/mariadb"
        }
        config {
          image = "bitnami/mariadb"
          ports = ["mariadb"]
        }
        service {
          name = "grafana-db"
          port = "mariadb"
          check {
            name     = "tcp_validate"
            type     = "tcp"
            port     = "mariadb"
            interval = "15s"
            timeout  = "5s"
          }
        }
        env {
          MARIADB_DATABASE     = "grafana"
          MARIADB_SKIP_TEST_DB = "yes"
          MARIADB_EXTRA_FLAGS = join(" ", [
            "--require_secure_transport=ON",
            "--ssl_ca=/secrets/ca.crt",
            "--ssl_cert=/secrets/cert.crt",
            "--ssl_key=/secrets/cert.key",
            "--tls_version=TLSv1.3",
          ])
        }
        template {
          destination = "secrets/ca.crt"
          change_mode = "restart"
          data        = <<EOH
{{ with secret "pki/cert/ca" }}{{.Data.certificate}}{{ end }}
  EOH
        }
        template {
          destination = "secrets/cert.crt"
          change_mode = "noop"
          data        = <<EOH
{{- with $secret := secret "pki/issue/..." "common_name=..." "ttl=3d"}}
{{.Data.certificate}}
{{- end}}
  EOH
        }
        template {
          destination = "secrets/cert.key"
          change_mode = "script"
          change_script {
            command = "/local/mariadb_ssl_reload.sh"
          }
          data = <<EOH
{{- with $secret := secret "pki/issue/..." "common_name=..." "ttl=3d"}}
{{.Data.private_key}}
{{- end}}
  EOH
        }
        template {
          data        = file("grafana/mariadb_ssl_reload.sh")
          destination = "local/mariadb_ssl_reload.sh"
          perms       = "750"
        }
      }
    }
  }

  group "app" {
    count = 1

    network {
      port "grafana-http" {
        to = 3000
      }
    }

    task "app" {
      driver = "docker"
    }
    
    ...
    grafana consumes the DB, getting it's address via Consul.
    ...

  }
}

mariadb_ssl_reload.sh #!/usr/bin/env bash mariadb --batch -u $MARIADB_ROOT_USER -p$MARIADB_ROOT_PASSWORD -e "flush ssl" --ssl && \ openssl x509 -in /secrets/cert.crt -dates -subject -issuer -noout > /local/ssl_reload_out.txt

Logs from Nomad client: Nomad UI: Task hook failed | Template failed to run script /local/mariadb_ssl_reload.sh with arguments [] because task driver doesn't support the exec operation

Jun 26 '23 16:06 kkornienko-aparavi

Thanks for the extra info @kkornienko-aparavi. Unfortunately I have not been able to reproduce this yet, even after restarting the Nomad agent.

It happens right after nomad client is restarted

Just to clarify this part, do you mean the Nomad client process is restarted or is it the entire machine, like a server reboot?

I'm reopening this issue just in case until we find a better answer.

Jul 05 '23 00:07 lgfa29

hey @lgfa29 😄 I've run into the same issue on 1.6.3 with a script used to update the Java keystore on changes to a certificate rendered from Vault's PKI via Consul template:

template {
        destination = "secrets/kafka.pem"
        change_mode = "script"
        change_script {
          command       = "/local/hot-reload-keystore.sh"
          timeout       = "20s"
          fail_on_error = false
        }
        splay = "0ms"
        data  = <<-EOF
        {{- with secret "pki/issue/kafka-brokers" "common_name=xxx" "ttl=10m" "alt_names=*.yyyy" "private_key_format=pkcs8" -}}
        {{ .Data.private_key }}

        {{ .Data.certificate }}

        {{ range .Data.ca_chain }}
        {{ . }}

        {{ end }}
        {{- end -}}
        EOF
}

While investigating the issue I was able to reproduce it with a for loop updating a key in Consul K/V and restarting the Nomad client while the loop is running.

Start the job

job "test" {
    region      = "dc"
    datacenters = ["dca"]

    constraint {
        attribute = node.unique.name
        value     = "xxx"
    }

    group "test" {
        count = 1

        task "test" {
            driver = "docker"
            config {
                image = "alpine:3.18.2"
                args  = ["tail", "-f", "/dev/null"]
            }

            template {
        data        = <<EOF
{{ range ls "services/my-ns/test" }}
{{.Key}}={{.Value}}
{{ end }}
EOF
        destination = "local/test.tmpl"
        change_mode = "script"
        change_script {
          command = "/local/change.sh"
        }
      }

      template {
        data        = <<EOF
#!/bin/sh
date >> local/run.log
EOF
        destination = "local/change.sh"
        perms       = "555"

      }
            resources {
                cores  = 4
                memory = 64
            }
        }
    }
}

Start the loop:

while : ; do consul kv put services/my-ns/test/foo a=$RANDOM ; sleep 5 ; done

Restart the Nomad client:

2023-12-11T14:08:34.829Z [DEBUG] agent: (runner) receiving dependency kv.list(services/my-ns/test)
2023-12-11T14:08:34.829Z [DEBUG] agent: (runner) initiating run
2023-12-11T14:08:34.829Z [DEBUG] agent: (runner) checking template 06ec856a7a6a01f081141dc0b6392a33
2023-12-11T14:08:34.830Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/nomad/data/alloc/409a1017-ed4c-608f-c9a6-d24366f47dc4/test/local/test.tmpl"
2023-12-11T14:08:34.833Z [INFO]  agent: (runner) rendered "(dynamic)" => "/nomad/data/alloc/409a1017-ed4c-608f-c9a6-d24366f47dc4/test/local/test.tmpl"
2023-12-11T14:08:34.833Z [DEBUG] agent: (runner) checking template a55d0446878b1ccea6af24e1d973c69e
2023-12-11T14:08:34.834Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/nomad/data/alloc/409a1017-ed4c-608f-c9a6-d24366f47dc4/test/local/change.sh"
2023-12-11T14:08:34.834Z [DEBUG] agent: (runner) diffing and updating dependencies
2023-12-11T14:08:34.834Z [DEBUG] agent: (runner) kv.list(services/my-ns/test) is still needed
2023-12-11T14:08:34.834Z [DEBUG] agent: (runner) watching 1 dependencies
2023-12-11T14:08:34.834Z [DEBUG] agent: (runner) all templates rendered
2023-12-11T14:08:37.219Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=409a1017-ed4c-608f-c9a6-d24366f47dc4 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false

Following runs succeed:

2023-12-11T14:08:43.021Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=409a1017-ed4c-608f-c9a6-d24366f47dc4 task=test type="Task hook message" msg="Template successfully ran script /local/change.sh with arguments: []. Exit code: 0" failed=false

Can you please check if you are able to reproduce it as well?

Dec 11 '23 14:12 the-nando

I've reproduced this in my lab as well, using the jobspecs provided by the-nando and lgfa29 above.
In each case the issue was noted only after a client restart.
The logs (nomad, journal & docker) and bundle attached were captured while testing with @the-nando's jobspec and steps.

Nomad version used

1.6.2+ent

Client restart timestamps / nomad log grep

root@nomad-client-1:/etc/docker# date; systemctl restart nomad; grep "Template failed to run script" /opt/nomad/nomad.log
Tue 12 Dec 2023 03:26:07 PM UTC
2023-12-12T15:26:04.381Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false

root@nomad-client-1:/etc/docker# date; systemctl restart nomad; grep "Template failed to run script" /opt/nomad/nomad.log
Tue 12 Dec 2023 03:26:34 PM UTC
2023-12-12T15:26:04.381Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false
2023-12-12T15:26:20.033Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false

root@nomad-client-1:/etc/docker# date; systemctl restart nomad; grep "Template failed to run script" /opt/nomad/nomad.log
Tue 12 Dec 2023 03:27:20 PM UTC
2023-12-12T15:26:04.381Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false
2023-12-12T15:26:20.033Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false
2023-12-12T15:26:45.017Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false

client_docker.log.zip journal_nomad_client.log.zip journal_nomad_server.log.zip nomad_client.log.zip nomad_server.log.zip nomad-debug-2023-12-12-152541Z.tar.gz

Dec 12 '23 19:12 ron-savoia

Anecdotally (I don't have any of the above dumps currently), this behavior continues in 1.7.5, and it seems to continue to be related to a nomad client restart.

Apr 30 '24 11:04 aaronkvanmeerten

Hi folks, I'm just getting started digging into this. At first glance it's pretty bizarre:

The error we're getting comes from the Poststart method of the template hook (ref template_hook.go#L203-L212) whenever the poststart request's DriverExec field is nil.
That request object is created in the taskrunner (ref task_runner_hooks.go#L350-L375) and we can see it's populated by a call to the function NewLazyHandle.
NewLazyHandle always returns an object, so there's no way for the field to be set to nil!

While I could easily imagine given the circumstances of the bug around client restarts that we've got some kind of race condition where the handle is not yet valid, that's not what the bug appears to be. My suspicion is that when the lazy handle's refreshHandleLocked method times out that we're setting the resulting nil object as the handle and effectively "cross contaminating" between hooks. This will take a good bit more debugging, but I'll update once I know more.

Jul 18 '24 20:07 tgross

Oh, looks like I was looking at the wrong error message. That's embarassing. :blush: The "Template failed to run script %v with arguments %v because task driver doesn't support the exec operation" is from the task template manager, not the hook (ref template.go#L588). That's much less puzzling.

In that case, the error isn't happening in the post-stop hook but in the pre-start hook. But the handle we need to fire the change script isn't being set until the post-stop hook. So the template renders for the "first" time (according to the task runner lifecycle), immediately detects a change to the file, and that triggers the change script even though we don't have a handle to the task yet.

The very helpful client logs that @ron-savoia posted show the order of events pretty clearly for alloc 05607ec2 (with irrelevant bits elided):

2023-12-12T15:26:03.008Z [TRACE] client.alloc_runner.task_runner: running prestart hook: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test name=template start="2023-12-12 15:26:03.008869715 +0000 UTC m=+10.148452814" 2023-12-12T15:26:03.008Z [INFO] agent: (runner) creating new runner (dry: false, once: false) ... (yadda yadda consul-template stuff) ... 2023-12-12T15:26:03.009Z [DEBUG] agent: (runner) rendering "(dynamic)" => "/opt/nomad/data/alloc/05607ec2-43e9-283c-e652-7b5ed2c4c412/test/local/change.sh" 2023-12-12T15:26:03.009Z [DEBUG] agent: (runner) diffing and updating dependencies 2023-12-12T15:26:03.009Z [DEBUG] agent: (runner) watching 1 dependencies ... 2023-12-12T15:26:03.012Z [DEBUG] agent: (runner) all templates rendered ... 2023-12-12T15:26:04.381Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test type="Task hook failed" msg="Template failed to run script /local/change.sh with arguments [] because task driver doesn't support the exec operation" failed=false ... 2023-12-12T15:26:04.389Z [TRACE] client.alloc_runner.task_runner: finished prestart hooks: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test end="2023-12-12 15:26:04.389997341 +0000 UTC m=+11.529580481" duration=1.389301764s ... 2023-12-12T15:26:04.402Z [TRACE] client.alloc_runner.task_runner: running poststart hook: alloc_id=05607ec2-43e9-283c-e652-7b5ed2c4c412 task=test name=template start="2023-12-12 15:26:04.402213108 +0000 UTC m=+11.541796249"

We're never going to have a valid task handle at the point of the prestart, so we simply can't run the script at that point.

What I'll do next is write a unit test demonstrating the exact behavior we're seeing, and then see if I can figure out a reasonable way to "queue-up" the script execution so that if we restore a task handle we can run it afterwards. Without getting stuck or trying to execute if the task handle fails to be restored (ex. the task gets killed between the time we render and get to the poststop hook method), of course.

(Edit: I thought this sounded familiar, and as it turns out we didn't trigger change mode on templates during restore until I fixed that way back in https://github.com/hashicorp/nomad/pull/9636. So either this is a regression or a bug in that original implementation years ago.)

Jul 19 '24 19:07 tgross

At this point reproduction is easy and deterministic, when running at least one Nomad server and separate Nomad client.

Create a variable:

$ nomad var put nomad/jobs/example name=Tim
Created variable "nomad/jobs/example" with modify index 428

Run the following job.

jobspec

job "example" {

  group "group" {

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    task "task" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      template {
        data = <<EOT
<html>
<h1>hello, {{ with nomadVar "nomad/jobs/example" -}}{{.name}}{{- end -}}</h1>
</html>
        EOT

        destination = "local/index.html"

        change_mode = "script"
        change_script {
          command       = "/bin/sh"
          args          = ["-c", "echo ok"]
          fail_on_error = true
        }
      }

      resources {
        cpu    = 50
        memory = 50
      }

    }
  }
}

Stop the client.

$ sudo systemctl stop nomad

Update the variable.

$ nomad var put -force nomad/jobs/example name=Ron
Created variable "nomad/jobs/example" with modify index 440

Restart the client.

$ sudo systemctl start nomad

See the allocation fail:

$ nomad alloc status
...
Recent Events:
Time                       Type              Description
2024-07-19T16:15:46-04:00  Terminated        Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2024-07-19T16:15:41-04:00  Killing           Template script failed, task is being killed
2024-07-19T16:15:41-04:00  Task hook failed  Template failed to run script /bin/sh with arguments [-c echo ok] because task driver doesn't support the exec operation
2024-07-19T16:15:08-04:00  Started           Task started by client
2024-07-19T16:15:08-04:00  Task Setup        Building Task Directory
2024-07-19T16:15:08-04:00  Received          Task received by client

Jul 19 '24 20:07 tgross

I've got a fix up here: https://github.com/hashicorp/nomad/pull/23663. The task driver handle was actually already available, but we simply we're providing it to the template hook in time for this use.

Jul 22 '24 18:07 tgross

We are using Nomad version v1.8.2+ent. The template block looks as follows

template {
        destination = "secrets/secret.txt"
        change_mode = "script"
        change_script {
          command = "/bin/sh"
          args    = ["sleep", "180", "&&", "touch", "/local/test"]
          timeout = "90s"
        }
        data        = <<EOT
{{- with secret "dynamic/test-template" }}{{ .Data.data.foo }}{{- end }}
EOT
      }

However, when the secret in the Vault is updated the post hook fails with "Template ran script /bin/sh with arguments [sleep 5 && touch /local/test] on change but it exited with code code: 2". There are no client restarts prior to this activity.

Jul 31 '24 08:07 aisvarya2

@aisvarya2 that's a different issue. In your case, the change script is running but you're getting an error. That's because you're passing invalid arguments to /bin/sh -- you can't pass what you'd type in the shell as arguments to /bin/sh. I believe you want something like:

change_script {
  command = "/bin/sh"
  args = ["-c", "'sleep 180 && touch /local/test'"]
  timeout = "90s"
}

Also, your timeout is less than the sleep window, but I suspect you've got some redaction going on here.

Jul 31 '24 12:07 tgross

@tgross we tried using the change_script as you mentioned with appropriate timeout but it does not work as expected when we pass it in the args. It works as expected when we use the command to run the shell script directly but not when used in args

template {
        destination = "secrets/secret.txt"
        change_mode = "script"
        change_script {
          command = "local/change.sh"
          timeout = "90s"
        }
        data        = <<EOT
{{- with secret "dynamic/test-template" }}{{ .Data.data.foo }}{{- end }}
EOT
      }

template {
        data        = <<EOF
#!/bin/sh
sleep 60
touch /local/test
EOF
        destination = "local/change.sh"
        perms       = "555"

      }

Aug 01 '24 06:08 aisvarya2

@aisvarya2 you'll want to look at some trace-level logging to see what's going on with the script then. Getting exit code 2 means there's something wrong with the script itself, not the change_mode block. If you're unable to debug that, please open a new issue as what you're seeing is not at all related to this one.

Aug 01 '24 12:08 tgross

@aisvarya2 I suppose it could be something wrong with permissions: 555 is different from 0555. And use = <<-EOF, thus you can indent your code properly. You can try to run empty script to test change_mode first:

#!/bin/sh
echo "Ok"

Or

#!/bin/bash
echo "Ok"

Add more permissions to the script (do not forget to remove them after debugging): 0777;

Aug 01 '24 14:08 EugenKon

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Dec 19 '24 02:12 github-actions[bot]

nomad nomad copied to clipboard

change_mode = "script" sometime fails with Docker

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Nomad version used

Client restart timestamps / nomad log grep

nomad
nomad copied to clipboard