nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Template rendering through nomad job failing on windows nodes

Open pavanrangain opened this issue 1 year ago • 7 comments

Nomad version

1.6.7

Operating system and Environment details

Windows Server 2019

Issue

Template rendering in a nomad job fails on windoes nodes since 1.6.7 (issue seen even in 1.6.8). Issue was not there with version 1.6.6

Reproduction steps

  1. Have a nomad job with a template to render to a file reading from consul KV or use default from file system

Expected Result

Job should get deployed successfully

Actual Result

Job failing with error something as below

Template failed: error rendering "(dynamic)" => "<path removed>//log_config": template render subprocess failed: exit status 0xc0000142 NOTE - actual path removed from error

Job file (if appropriate)

job "test-windoes-job" {
  region      = "global"
  datacenters = ["my-test-cluster"]
  type        = "system"

  group "test-winodows-group" {
    count = 1

    constraint {
      attribute = "${attr.kernel.name}"
      value     = "windows"
    }
  
  task "test-windows-task" {
      driver = "raw_exec"

      artifact {
        source = "<some artifactory url>/filebeat.exe"
      }

      env {
        LOGZIO_CODEC          = "json"
        IP_ADDRESS            = "${attr.unique.network.ip-address}"
        HOSTNAME              = "${attr.unique.hostname}"
        CLUSTER_NAME          = "my-test-cluster"     
      }

      config {
        command = "filebeat.exe"
        args    = ["-c", "local/log_config"]
      }

      template {
        data = <<EOH
          ############################# Filebeat #####################################

          filebeat.inputs:

          - type: log
            paths:
              - ${NOMAD_ALLOC_DIR}/logs/*
            fields:
              logzio_codec: ${LOGZIO_CODEC:'json'}
              token: ${LOGZIO_TOKEN:''}
              clusterName: ${CLUSTER_NAME:''}
            fields_under_root: true
            encoding: utf-8
            ignore_older: 24h
            tail_files: true
            exclude_lines: ${excludeLines:[]}

          #The following processors are to ensure compatibility with version 7
          processors:
          - rename:
              fields:
              - from: "agent"
                to: "beat_agent"
              ignore_missing: true
          - rename:
              fields:
              - from: "log.file.path"
                to: "source"
              ignore_missing: true

          ############################# Output ##########################################
          output:
            logstash:
              enabled: false
              hosts: ["listener.logz.io:5015}"]
        EOH

        destination = "local/log_config"
        change_mode = "restart"
      }

      resources {
        cpu    = 100 # Mhz
        memory = 100 # MB
      }
    }
  }
}

Nomad Server logs (if appropriate)

Nothing relevant

Nomad Client logs (if appropriate)

Just shows same error Template failed: error rendering "(dynamic)" => "<path removed>//log_config": template render subprocess failed: exit status 0xc0000142

Observation:

Issue may be with this change that went into 1.6.7. There is no issue seen on linux node wrt to template rendering. Issue is only on windows nodes

pavanrangain avatar Feb 26 '24 12:02 pavanrangain

I'm experiencing a similar problem. Template rendering fails on the following template:

template {
  data        = <<EOH
    USERNAME="{{ with secret "secret/path" }}{{ .Data.data.username }}{{ end }}"
    PASSWORD="{{ with secret "secret/path" }}{{ .Data.data.password }}{{ end }}"
  EOH
  destination = "secrets/vault.cred"
  env         = true
}

I didn't have any issues with this template on 1.7.1, but after upgrading to 1.7.5 I started to get the template render subprocess failed: exit status 0xc0000142 error.

meowtini avatar Feb 26 '24 14:02 meowtini

I've experienced the same issue in both 1.7.5 and 1.7.4. 1.7.2 is working.

hardselius avatar Feb 27 '24 11:02 hardselius

Hi @meowtini @hardselius and @hardselius; thanks for raising and contributing to this issue. I believe this is caused by the changes introduced within https://github.com/hashicorp/nomad/issues/19888 and therefore I would ask for some additional information to help us to understand the problem which our testing missed.

  • client logs from when the task is placed until after the template render fails
  • details of the Nomad client binary permissions
  • details on the user that is being used to run the Nomad client binary
  • permission details on the template destination as well as the parent path

Thanks.

jrasell avatar Feb 27 '24 11:02 jrasell

@jrasell Pls find info you had asked

  • client logs from when the task is placed until after the template render fails nomad-windows-agent-logs.txt

  • details of the Nomad client binary permissions nomad_file_permission

  • details on the user that is being used to run the Nomad client binary Its windows SYSTEM user permission details on the template destination as well as the parent path SYSTEM usr has full permission nomad_folder_permissions

Again this is seen only on Windows server nodes and not on linux nodes (atleast in our case)

pavanrangain avatar Feb 28 '24 10:02 pavanrangain

Again this is seen only on Windows server nodes and not on linux nodes (atleast in our case)

Yeah, the security update in 1.6.7 has significantly different implementation on Windows than on any other operating system. We had to implement AppContainers rather than just chrooting the rendering subprocess.

Unfortunately it looks like the client logs you've provided here are at info-level only so we may be missing some context. Here's the only relevant bits:

{"@level":"info","@message":"(runner) creating new runner (dry: false, once: false)","@module":"agent","@timestamp":"2024-02-28T04:59:42.152055-05:00"}
{"@level":"info","@message":"(runner) creating watcher","@module":"agent","@timestamp":"2024-02-28T04:59:42.153260-05:00"}
{"@level":"info","@message":"(runner) starting","@module":"agent","@timestamp":"2024-02-28T04:59:42.153828-05:00"}
{"@level":"error","@message":"exit status 0xc0000142","@module":"agent","@timestamp":"2024-02-28T04:59:42.204358-05:00"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-02-28T04:59:42.204358-05:00","alloc_id":"de978c15-f7fe-ec5a-c386-a205a79ec2d5","failed":true,"msg":"Template failed: error rendering \"(dynamic)\" =\u003e \"C:\\\\ProgramData\\\\nomad\\\\alloc\\\\de978c15-f7fe-ec5a-c386-a205a79ec2d5\\\\control-plane-logging-task\\\\local\\\\log_config\": template render subprocess failed: exit status 0xc0000142","task":"control-plane-logging-task","type":"Killing"}

According to the MSFT error reference documentation (PDF), the exit code we're getting here is STATUS_DLL_INIT_FAILED. Which implies we weren't able to open the rendering process because we couldn't open a DLL somewhere. This doesn't make a whole lot of sense, as the rendering process doesn't load any external DLLs. So I'm still investigating as to how we could be hitting this error.

It shouldn't make a difference, but just to help me eliminate possibilities, are you running Nomad as a Windows Service using the instructions in Register Nomad with Windows, or some other way?

tgross avatar Mar 01 '24 15:03 tgross

trace_log_windows_client.json Pls find the trace logs. The nomad is registered as windows native service nomad service properties

pavanrangain avatar Mar 05 '24 08:03 pavanrangain

Thanks @pavanrangain! Sorry to get back so slowly on this. I just wanted to pop in and say we haven't forgotten you but I've been swamped with a couple of other items. We're going to hand this issue off to @angrycub who helped me work on the security fix that's at the heart of this, and he'll start looking at this once he's wrapped up his current task. Thanks for your patience!

tgross avatar Mar 15 '24 14:03 tgross

Any progress on this? We're still stuck on 1.7.3 in order to avoid this issue.

DTTerastar avatar Apr 16 '24 18:04 DTTerastar

Hi @DTTerastar! We have a solid idea of the problem, which is a difference in ambient credentials between running Nomad as a Windows service vs otherwise (which is what all our tests did! :facepalm:). It's taking longer than we'd like to figure out the solution however. We'll update this issue when we have more information.

tgross avatar Apr 16 '24 18:04 tgross

I am having same issue on ubuntu for many jobs.

Template failed: error rendering "(dynamic)" => "/etc/nomad.d/data/alloc/44a00841-7dc1-bef9-f3a4-98f456be2d8f/a1beb0eb-1d27-41b6-9324-3a5f15642a25/local/.env": template render subprocess failed: signal: killed

mikedvinci90 avatar May 05 '24 12:05 mikedvinci90

@mikedvinci90 if your issue isn't on Windows, please open a new issue for that. The isolation mechanism is very different between the two OS. Debugging this is likely possible on Linux without the patch we're working on (slowly!) for Windows.

tgross avatar May 06 '24 12:05 tgross

Hi @tgross . I'm facing the similar issue. Really appreciate if you can help. Thanks.

Nomad version

1.7.7

Operating system and Environment details

Windows 11 Home OS build: 22621.3447

Issue:

The template rendering works fine if it was running the Nomad binary by Powershell (Administrator Mode) but it fails in running as Window Service.

On the Web UI, I saw the error message when I used Nomad Window Service. "Task hook failed: template: failed to read template: exit status 0xc0000142" nomad.log

But actually the template file is there. image

Reproduction steps

Just use "sc.exe create ..." to create Window Service and user "Local System" as the running user. Run a simple job with "template block"

My server configuration

datacenter = "dc0"
name = "nomad-on-win11"

data_dir  = "D:\\hashicorp\\nomad\\data"
log_file = "D:\\hashicorp\\nomad\\log\\nomad.log"
log_level = "DEBUG"
bind_addr = "0.0.0.0"

server {
  # license_path is required for Nomad Enterprise as of Nomad v1.1.1+
  #license_path = "/etc/nomad.d/license.hclic"
  enabled          = true
  bootstrap_expect = 1

  # This is the IP address of the first server provisioned
  server_join {
    # nslookup "$(hostname).local"
    retry_join = ["127.0.0.1:4648"]
    retry_max = 3
    retry_interval = "15s"
  }
}

client {
  enabled = true
  servers = ["127.0.0.1"]
  # use command to find the interface name "netsh int ipv4 show interfaces"
  network_interface = "Loopback Pseudo-Interface 1"
}

plugin "raw_exec" {
    config {
      enabled = true
    }
}

My Job

job "fo-component" {


  group "example" {

    task "service-task" {
      artifact {
        source      = "https://github.com/thfai2000/jenkins-pipelines/releases/download/1.0/artifact-1.0.zip"
        destination = "local/app"
      }

      template {
        source        = "local/app/config.xml.tpl"
        destination   = "local/app/config.xml"
      }


      driver = "raw_exec"
      config {
        command = "local/app/bin/Release/net8.0/win-x64/.net.exe"
      }

    }

  }

}

The Window Service Properties

image The executable: D:\hashicorp\nomad\bin\nomad.exe agent -config=D:\hashicorp\nomad\config\nomad.hcl image

thfai2000 avatar May 12 '24 10:05 thfai2000

@thfai2000 for now the only solution is to disable the file sandbox: https://developer.hashicorp.com/nomad/docs/configuration/client#disable_file_sandbox This sounds much worse than is really is, as you're already using raw_exec and the task itself can bypass the sandbox. We're still working on trying to figure out a better long term solution, including engaging with our partners at the OS vendor.

tgross avatar May 13 '24 13:05 tgross

hi @tgross

Thanks for your advice. It works now after I use "disable_file_sandbox = true" in my server configuration file. Good to hear that your team is trying to figure out a better solution and really appreciated your team's effort. thanks.

client {
  enabled = true
  template {
    disable_file_sandbox = true
  }
  servers = ["127.0.0.1"]
  # use command to find the interface name "netsh int ipv4 show interfaces"
  network_interface = "Loopback Pseudo-Interface 1"
}

plugin "raw_exec" {
    config {
      enabled = true
    }
}

thfai2000 avatar May 15 '24 10:05 thfai2000

Not the same issue but deeply interrelated: https://github.com/hashicorp/nomad/issues/20585

tgross avatar May 17 '24 20:05 tgross

Disabling the file sandbox also worked for us but would to see a proper fix for this.

gscho avatar May 18 '24 02:05 gscho

Hi @pavanrangain, we just merged 2 changes that will remedy this problem. Nomad 1.8.2 will no longer sandbox template rendering on Windows, and to address the security aspect (which is only relevant for running Docker with Process Isolation as ContainerAdmin) it will perform checks in the Docker driver. I will close the issue for now, feel free to re-open if the problem persists.

pkazmierczak avatar Jun 28 '24 15:06 pkazmierczak