nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Template rendering through nomad job failing on windows nodes

Open pavanrangain opened this issue 4 months ago • 7 comments

Nomad version

1.6.7

Operating system and Environment details

Windows Server 2019

Issue

Template rendering in a nomad job fails on windoes nodes since 1.6.7 (issue seen even in 1.6.8). Issue was not there with version 1.6.6

Reproduction steps

  1. Have a nomad job with a template to render to a file reading from consul KV or use default from file system

Expected Result

Job should get deployed successfully

Actual Result

Job failing with error something as below

Template failed: error rendering "(dynamic)" => "<path removed>//log_config": template render subprocess failed: exit status 0xc0000142 NOTE - actual path removed from error

Job file (if appropriate)

job "test-windoes-job" {
  region      = "global"
  datacenters = ["my-test-cluster"]
  type        = "system"

  group "test-winodows-group" {
    count = 1

    constraint {
      attribute = "${attr.kernel.name}"
      value     = "windows"
    }
  
  task "test-windows-task" {
      driver = "raw_exec"

      artifact {
        source = "<some artifactory url>/filebeat.exe"
      }

      env {
        LOGZIO_CODEC          = "json"
        IP_ADDRESS            = "${attr.unique.network.ip-address}"
        HOSTNAME              = "${attr.unique.hostname}"
        CLUSTER_NAME          = "my-test-cluster"     
      }

      config {
        command = "filebeat.exe"
        args    = ["-c", "local/log_config"]
      }

      template {
        data = <<EOH
          ############################# Filebeat #####################################

          filebeat.inputs:

          - type: log
            paths:
              - ${NOMAD_ALLOC_DIR}/logs/*
            fields:
              logzio_codec: ${LOGZIO_CODEC:'json'}
              token: ${LOGZIO_TOKEN:''}
              clusterName: ${CLUSTER_NAME:''}
            fields_under_root: true
            encoding: utf-8
            ignore_older: 24h
            tail_files: true
            exclude_lines: ${excludeLines:[]}

          #The following processors are to ensure compatibility with version 7
          processors:
          - rename:
              fields:
              - from: "agent"
                to: "beat_agent"
              ignore_missing: true
          - rename:
              fields:
              - from: "log.file.path"
                to: "source"
              ignore_missing: true

          ############################# Output ##########################################
          output:
            logstash:
              enabled: false
              hosts: ["listener.logz.io:5015}"]
        EOH

        destination = "local/log_config"
        change_mode = "restart"
      }

      resources {
        cpu    = 100 # Mhz
        memory = 100 # MB
      }
    }
  }
}

Nomad Server logs (if appropriate)

Nothing relevant

Nomad Client logs (if appropriate)

Just shows same error Template failed: error rendering "(dynamic)" => "<path removed>//log_config": template render subprocess failed: exit status 0xc0000142

Observation:

Issue may be with this change that went into 1.6.7. There is no issue seen on linux node wrt to template rendering. Issue is only on windows nodes

pavanrangain avatar Feb 26 '24 12:02 pavanrangain

I'm experiencing a similar problem. Template rendering fails on the following template:

template {
  data        = <<EOH
    USERNAME="{{ with secret "secret/path" }}{{ .Data.data.username }}{{ end }}"
    PASSWORD="{{ with secret "secret/path" }}{{ .Data.data.password }}{{ end }}"
  EOH
  destination = "secrets/vault.cred"
  env         = true
}

I didn't have any issues with this template on 1.7.1, but after upgrading to 1.7.5 I started to get the template render subprocess failed: exit status 0xc0000142 error.

meowtini avatar Feb 26 '24 14:02 meowtini

I've experienced the same issue in both 1.7.5 and 1.7.4. 1.7.2 is working.

hardselius avatar Feb 27 '24 11:02 hardselius

Hi @meowtini @hardselius and @hardselius; thanks for raising and contributing to this issue. I believe this is caused by the changes introduced within https://github.com/hashicorp/nomad/issues/19888 and therefore I would ask for some additional information to help us to understand the problem which our testing missed.

  • client logs from when the task is placed until after the template render fails
  • details of the Nomad client binary permissions
  • details on the user that is being used to run the Nomad client binary
  • permission details on the template destination as well as the parent path

Thanks.

jrasell avatar Feb 27 '24 11:02 jrasell

@jrasell Pls find info you had asked

  • client logs from when the task is placed until after the template render fails nomad-windows-agent-logs.txt

  • details of the Nomad client binary permissions nomad_file_permission

  • details on the user that is being used to run the Nomad client binary Its windows SYSTEM user permission details on the template destination as well as the parent path SYSTEM usr has full permission nomad_folder_permissions

Again this is seen only on Windows server nodes and not on linux nodes (atleast in our case)

pavanrangain avatar Feb 28 '24 10:02 pavanrangain

Again this is seen only on Windows server nodes and not on linux nodes (atleast in our case)

Yeah, the security update in 1.6.7 has significantly different implementation on Windows than on any other operating system. We had to implement AppContainers rather than just chrooting the rendering subprocess.

Unfortunately it looks like the client logs you've provided here are at info-level only so we may be missing some context. Here's the only relevant bits:

{"@level":"info","@message":"(runner) creating new runner (dry: false, once: false)","@module":"agent","@timestamp":"2024-02-28T04:59:42.152055-05:00"}
{"@level":"info","@message":"(runner) creating watcher","@module":"agent","@timestamp":"2024-02-28T04:59:42.153260-05:00"}
{"@level":"info","@message":"(runner) starting","@module":"agent","@timestamp":"2024-02-28T04:59:42.153828-05:00"}
{"@level":"error","@message":"exit status 0xc0000142","@module":"agent","@timestamp":"2024-02-28T04:59:42.204358-05:00"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-02-28T04:59:42.204358-05:00","alloc_id":"de978c15-f7fe-ec5a-c386-a205a79ec2d5","failed":true,"msg":"Template failed: error rendering \"(dynamic)\" =\u003e \"C:\\\\ProgramData\\\\nomad\\\\alloc\\\\de978c15-f7fe-ec5a-c386-a205a79ec2d5\\\\control-plane-logging-task\\\\local\\\\log_config\": template render subprocess failed: exit status 0xc0000142","task":"control-plane-logging-task","type":"Killing"}

According to the MSFT error reference documentation (PDF), the exit code we're getting here is STATUS_DLL_INIT_FAILED. Which implies we weren't able to open the rendering process because we couldn't open a DLL somewhere. This doesn't make a whole lot of sense, as the rendering process doesn't load any external DLLs. So I'm still investigating as to how we could be hitting this error.

It shouldn't make a difference, but just to help me eliminate possibilities, are you running Nomad as a Windows Service using the instructions in Register Nomad with Windows, or some other way?

tgross avatar Mar 01 '24 15:03 tgross

trace_log_windows_client.json Pls find the trace logs. The nomad is registered as windows native service nomad service properties

pavanrangain avatar Mar 05 '24 08:03 pavanrangain

Thanks @pavanrangain! Sorry to get back so slowly on this. I just wanted to pop in and say we haven't forgotten you but I've been swamped with a couple of other items. We're going to hand this issue off to @angrycub who helped me work on the security fix that's at the heart of this, and he'll start looking at this once he's wrapped up his current task. Thanks for your patience!

tgross avatar Mar 15 '24 14:03 tgross

Any progress on this? We're still stuck on 1.7.3 in order to avoid this issue.

DTTerastar avatar Apr 16 '24 18:04 DTTerastar

Hi @DTTerastar! We have a solid idea of the problem, which is a difference in ambient credentials between running Nomad as a Windows service vs otherwise (which is what all our tests did! :facepalm:). It's taking longer than we'd like to figure out the solution however. We'll update this issue when we have more information.

tgross avatar Apr 16 '24 18:04 tgross