nomad
nomad copied to clipboard
Template rendering through nomad job failing on windows nodes
Nomad version
1.6.7
Operating system and Environment details
Windows Server 2019
Issue
Template rendering in a nomad job fails on windoes nodes since 1.6.7 (issue seen even in 1.6.8). Issue was not there with version 1.6.6
Reproduction steps
- Have a nomad job with a template to render to a file reading from consul KV or use default from file system
Expected Result
Job should get deployed successfully
Actual Result
Job failing with error something as below
Template failed: error rendering "(dynamic)" => "<path removed>//log_config": template render subprocess failed: exit status 0xc0000142
NOTE - actual path removed from error
Job file (if appropriate)
job "test-windoes-job" {
region = "global"
datacenters = ["my-test-cluster"]
type = "system"
group "test-winodows-group" {
count = 1
constraint {
attribute = "${attr.kernel.name}"
value = "windows"
}
task "test-windows-task" {
driver = "raw_exec"
artifact {
source = "<some artifactory url>/filebeat.exe"
}
env {
LOGZIO_CODEC = "json"
IP_ADDRESS = "${attr.unique.network.ip-address}"
HOSTNAME = "${attr.unique.hostname}"
CLUSTER_NAME = "my-test-cluster"
}
config {
command = "filebeat.exe"
args = ["-c", "local/log_config"]
}
template {
data = <<EOH
############################# Filebeat #####################################
filebeat.inputs:
- type: log
paths:
- ${NOMAD_ALLOC_DIR}/logs/*
fields:
logzio_codec: ${LOGZIO_CODEC:'json'}
token: ${LOGZIO_TOKEN:''}
clusterName: ${CLUSTER_NAME:''}
fields_under_root: true
encoding: utf-8
ignore_older: 24h
tail_files: true
exclude_lines: ${excludeLines:[]}
#The following processors are to ensure compatibility with version 7
processors:
- rename:
fields:
- from: "agent"
to: "beat_agent"
ignore_missing: true
- rename:
fields:
- from: "log.file.path"
to: "source"
ignore_missing: true
############################# Output ##########################################
output:
logstash:
enabled: false
hosts: ["listener.logz.io:5015}"]
EOH
destination = "local/log_config"
change_mode = "restart"
}
resources {
cpu = 100 # Mhz
memory = 100 # MB
}
}
}
}
Nomad Server logs (if appropriate)
Nothing relevant
Nomad Client logs (if appropriate)
Just shows same error
Template failed: error rendering "(dynamic)" => "<path removed>//log_config": template render subprocess failed: exit status 0xc0000142
Observation:
Issue may be with this change that went into 1.6.7. There is no issue seen on linux node wrt to template rendering. Issue is only on windows nodes
I'm experiencing a similar problem. Template rendering fails on the following template:
template {
data = <<EOH
USERNAME="{{ with secret "secret/path" }}{{ .Data.data.username }}{{ end }}"
PASSWORD="{{ with secret "secret/path" }}{{ .Data.data.password }}{{ end }}"
EOH
destination = "secrets/vault.cred"
env = true
}
I didn't have any issues with this template on 1.7.1, but after upgrading to 1.7.5 I started to get the template render subprocess failed: exit status 0xc0000142
error.
I've experienced the same issue in both 1.7.5
and 1.7.4
. 1.7.2
is working.
Hi @meowtini @hardselius and @hardselius; thanks for raising and contributing to this issue. I believe this is caused by the changes introduced within https://github.com/hashicorp/nomad/issues/19888 and therefore I would ask for some additional information to help us to understand the problem which our testing missed.
- client logs from when the task is placed until after the template render fails
- details of the Nomad client binary permissions
- details on the user that is being used to run the Nomad client binary
- permission details on the template destination as well as the parent path
Thanks.
@jrasell Pls find info you had asked
-
client logs from when the task is placed until after the template render fails nomad-windows-agent-logs.txt
-
details of the Nomad client binary permissions
-
details on the user that is being used to run the Nomad client binary Its windows SYSTEM user permission details on the template destination as well as the parent path SYSTEM usr has full permission
Again this is seen only on Windows server nodes and not on linux nodes (atleast in our case)
Again this is seen only on Windows server nodes and not on linux nodes (atleast in our case)
Yeah, the security update in 1.6.7 has significantly different implementation on Windows than on any other operating system. We had to implement AppContainers rather than just chrooting the rendering subprocess.
Unfortunately it looks like the client logs you've provided here are at info-level only so we may be missing some context. Here's the only relevant bits:
{"@level":"info","@message":"(runner) creating new runner (dry: false, once: false)","@module":"agent","@timestamp":"2024-02-28T04:59:42.152055-05:00"}
{"@level":"info","@message":"(runner) creating watcher","@module":"agent","@timestamp":"2024-02-28T04:59:42.153260-05:00"}
{"@level":"info","@message":"(runner) starting","@module":"agent","@timestamp":"2024-02-28T04:59:42.153828-05:00"}
{"@level":"error","@message":"exit status 0xc0000142","@module":"agent","@timestamp":"2024-02-28T04:59:42.204358-05:00"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-02-28T04:59:42.204358-05:00","alloc_id":"de978c15-f7fe-ec5a-c386-a205a79ec2d5","failed":true,"msg":"Template failed: error rendering \"(dynamic)\" =\u003e \"C:\\\\ProgramData\\\\nomad\\\\alloc\\\\de978c15-f7fe-ec5a-c386-a205a79ec2d5\\\\control-plane-logging-task\\\\local\\\\log_config\": template render subprocess failed: exit status 0xc0000142","task":"control-plane-logging-task","type":"Killing"}
According to the MSFT error reference documentation (PDF), the exit code we're getting here is STATUS_DLL_INIT_FAILED
. Which implies we weren't able to open the rendering process because we couldn't open a DLL somewhere. This doesn't make a whole lot of sense, as the rendering process doesn't load any external DLLs. So I'm still investigating as to how we could be hitting this error.
It shouldn't make a difference, but just to help me eliminate possibilities, are you running Nomad as a Windows Service using the instructions in Register Nomad with Windows, or some other way?
trace_log_windows_client.json Pls find the trace logs. The nomad is registered as windows native service
Thanks @pavanrangain! Sorry to get back so slowly on this. I just wanted to pop in and say we haven't forgotten you but I've been swamped with a couple of other items. We're going to hand this issue off to @angrycub who helped me work on the security fix that's at the heart of this, and he'll start looking at this once he's wrapped up his current task. Thanks for your patience!
Any progress on this? We're still stuck on 1.7.3 in order to avoid this issue.
Hi @DTTerastar! We have a solid idea of the problem, which is a difference in ambient credentials between running Nomad as a Windows service vs otherwise (which is what all our tests did! :facepalm:). It's taking longer than we'd like to figure out the solution however. We'll update this issue when we have more information.