nomad
nomad copied to clipboard
Allocation resources usage showing zero for rootless raw_exec driver in v1.7.6
Nomad version
Nomad v1.6.9 (expected behavior) compared with Nomad v1.7.6 (incorrect behavior)
Operating system and Environment details
cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)
Issue
In v1.7.6, allocation resources usage showing wrong (zero) for Raw_Exec driver, while nomad is running rootless environment (NOT running as ROOT). In the contrast, it would display properly in v1.6.9. FYI, I don't know how nomad behaves under ROOT or under other drivers (Java, Docker, Exec,...), since we can't run as ROOT.
Reproduction steps
I did these steps below exactly for nomad v1.6.9 and v1.7.6, while nomad v1.69 output expected value and nomad v1.7.6 output wrong (zero) values.
- start a dev nomad server and client,
nomad agent -dev -log-include-location -log-level=debug -bind 0.0.0.0
- run a very simply python server
nomad job run python-service.hcl
. Note: the python-service.hcl is shown below - Once the job is running, display the allocation info by running
nomad alloc status <correspondingAllocId>
. FYI, this misbehavior is not just happening onnomad alloc status
command, it's happening across the board, including the usage graph on UI dashboard, /v1/metrics API output.
Expected Result
The CPU and Memory usage should be displayed properly, non-zero, as shown in nomad v1.6.9.
nomad/nomad1.6.9 alloc status 94e3e990
ID = 94e3e990-e0ff-efb1-0bd2-aca285bfd853
Eval ID = 901d4f9c
Name = python-service.python-service-nomad[0]
Node ID = d25b0ede
Node Name = txvslgcath004
Job ID = python-service
Job Version = 0
Client Status = running
Client Description = Tasks are running
Desired Status = run
Desired Description = <none>
Created = 37s ago
Modified = 27s ago
Deployment ID = 01b02560
Deployment Health = healthy
Allocation Addresses:
Label Dynamic Address
*webapp0 yes 127.0.0.1:25915
Task "web-app-nomad" is "running"
Task Resources:
CPU Memory Disk Addresses
29/100 MHz 47 MiB/300 MiB 300 MiB
Task Events:
Started At = 2024-04-04T10:05:04Z
Finished At = N/A
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2024-04-04T11:05:04+01:00 Started Task started by client
2024-04-04T11:05:04+01:00 Task Setup Building Task Directory
2024-04-04T11:05:04+01:00 Received Task received by client
Actual Result
As shown in output below, nomad v1.7.6 output 0 for both CPU and memory usage.
nomad/nomad1.7.6 alloc status b427e125-2117-80a4-f02d-f0c730eb7aaa
ID = b427e125-2117-80a4-f02d-f0c730eb7aaa
Eval ID = c9dd3ba9
Name = python-service.python-service-nomad[0]
Node ID = e5be6681
Node Name = txvslgcath004
Job ID = python-service
Job Version = 0
Client Status = running
Client Description = Tasks are running
Desired Status = run
Desired Description = <none>
Created = 5m37s ago
Modified = 5m27s ago
Deployment ID = 48763ca7
Deployment Health = healthy
Allocation Addresses:
Label Dynamic Address
*webapp0 yes 127.0.0.1:25201
Task "web-app-nomad" is "running"
Task Resources:
CPU Memory Disk Addresses
0/100 MHz 0 B/300 MiB 300 MiB
Task Events:
Started At = 2024-04-04T09:56:27Z
Finished At = N/A
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2024-04-04T10:56:27+01:00 Started Task started by client
2024-04-04T10:56:27+01:00 Task Setup Building Task Directory
2024-04-04T10:56:27+01:00 Received Task received by client
Job file (if appropriate)
job "python-service" {
# Specifies the datacenter where this job should be run
# This can be omitted and it will default to ["*"]
datacenters = ["*"]
group "python-service-nomad" {
network {
port "webapp0" {}
}
task "web-app-nomad" {
driver = "raw_exec"
config {
command = "python"
args = ["-m","http.server","${NOMAD_PORT_webapp0}"]
}
}
}
}
Nomad Server logs (if appropriate)
Below are two startup logs from both Nomad v1.6.9 and Nomad v.1.7.6. I noticed that cgroup is not detected in v1.7.6, I don't think it's related to CPU/Memory usage info, just worth mentioning.
Startup log of Nomad v1.6.9
nomad/nomad1.6.9 agent -dev -log-include-location -log-level=debug -bind 0.0.0.0
==> No configuration files loaded
==> Starting Nomad agent...
==> Nomad agent configuration:
Advertise Addrs: HTTP: 10.57.20.156:4646; RPC: 10.57.20.156:4647; Serf: 10.57.20.156:4648
Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
Client: true
Log Level: debug
Node Id: 41b8f952-3c9e-261e-e6c3-41bb032a0ea0
Region: global (DC: dc1)
Server: true
Version: 1.6.9
==> Nomad agent started! Log data will stream in below:
2024-04-04T11:03:56.642+0100 [INFO] [email protected]/api.go:591: nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:8180a835-a887-8df5-ce2f-74638feed68f Address:10.57.20.156:4647}]"
2024-04-04T11:03:56.642+0100 [INFO] [email protected]/raft.go:159: nomad.raft: entering follower state: follower="Node at 10.57.20.156:4647 [Follower]" leader-address= leader-id=
2024-04-04T11:03:56.642+0100 [INFO] [email protected]/stdlog.go:60: nomad: serf: EventMemberJoin: txvslgcath004.global 10.57.20.156
2024-04-04T11:03:56.642+0100 [INFO] nomad/server.go:1757: nomad: starting scheduling worker(s): num_workers=8 schedulers=["service", "batch", "system", "sysbatch", "_core"]
2024-04-04T11:03:56.642+0100 [DEBUG] nomad/server.go:1764: nomad: started scheduling worker: id=28f9afe0-bd29-0f0e-ec62-b2c8dab6242a index=1 of=8
2024-04-04T11:03:56.642+0100 [DEBUG] nomad/server.go:1764: nomad: started scheduling worker: id=39ae8096-4293-76cf-617c-10560ad95c9b index=2 of=8
2024-04-04T11:03:56.642+0100 [DEBUG] nomad/server.go:1764: nomad: started scheduling worker: id=23e63116-a6f7-b30c-b6ec-c2ed7511a294 index=3 of=8
2024-04-04T11:03:56.642+0100 [DEBUG] nomad/server.go:1764: nomad: started scheduling worker: id=3419dbbc-790a-4b45-680b-6d136557a8bd index=4 of=8
2024-04-04T11:03:56.642+0100 [DEBUG] nomad/server.go:1764: nomad: started scheduling worker: id=17157abf-1a73-7150-2297-66206481f425 index=5 of=8
2024-04-04T11:03:56.642+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=23e63116-a6f7-b30c-b6ec-c2ed7511a294
2024-04-04T11:03:56.642+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=17157abf-1a73-7150-2297-66206481f425
2024-04-04T11:03:56.642+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=39ae8096-4293-76cf-617c-10560ad95c9b
2024-04-04T11:03:56.642+0100 [DEBUG] nomad/server.go:1764: nomad: started scheduling worker: id=8a92f25e-dd71-a624-f1da-e189b997463b index=6 of=8
2024-04-04T11:03:56.642+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=28f9afe0-bd29-0f0e-ec62-b2c8dab6242a
2024-04-04T11:03:56.643+0100 [DEBUG] nomad/server.go:1764: nomad: started scheduling worker: id=514b0144-8ff1-abcc-2154-de1b38bd8567 index=7 of=8
2024-04-04T11:03:56.643+0100 [DEBUG] nomad/server.go:1764: nomad: started scheduling worker: id=8b983810-6d40-0e03-89b7-1e5b0a9e6e6a index=8 of=8
2024-04-04T11:03:56.643+0100 [INFO] nomad/server.go:1769: nomad: started scheduling worker(s): num_workers=8 schedulers=["service", "batch", "system", "sysbatch", "_core"]
2024-04-04T11:03:56.643+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=514b0144-8ff1-abcc-2154-de1b38bd8567
2024-04-04T11:03:56.642+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=3419dbbc-790a-4b45-680b-6d136557a8bd
2024-04-04T11:03:56.643+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=8b983810-6d40-0e03-89b7-1e5b0a9e6e6a
2024-04-04T11:03:56.642+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=8a92f25e-dd71-a624-f1da-e189b997463b
2024-04-04T11:03:56.643+0100 [INFO] nomad/serf.go:62: nomad: adding server: server="txvslgcath004.global (Addr: 10.57.20.156:4647) (DC: dc1)"
2024-04-04T11:03:56.643+0100 [DEBUG] nomad/encrypter.go:446: nomad.keyring.replicator: starting encryption key replication
2024-04-04T11:03:56.645+0100 [DEBUG] docker/driver.go:1885: agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=""
2024-04-04T11:03:56.645+0100 [INFO] agent/plugins.go:41: agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2024-04-04T11:03:56.645+0100 [INFO] agent/plugins.go:41: agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2024-04-04T11:03:56.645+0100 [INFO] agent/plugins.go:41: agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2024-04-04T11:03:56.645+0100 [INFO] agent/plugins.go:41: agent: detected plugin: name=java type=driver plugin_version=0.1.0
2024-04-04T11:03:56.645+0100 [INFO] agent/plugins.go:41: agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2024-04-04T11:03:56.645+0100 [WARN] cgutil/cpuset_manager_v1.go:48: client.cpuset.v1: failed to ensure cgroup parent exists; disable cpuset management: error="mkdir /sys/fs/cgroup/cpuset/nomad: permission denied"
2024-04-04T11:03:56.645+0100 [INFO] client/client.go:623: client: using state directory: state_dir=/tmp/NomadClient3429186125
2024-04-04T11:03:56.645+0100 [INFO] client/client.go:670: client: using alloc directory: alloc_dir=/tmp/NomadClient1804281296
2024-04-04T11:03:56.645+0100 [INFO] client/client.go:678: client: using dynamic ports: min=20000 max=32000 reserved=""
2024-04-04T11:03:56.654+0100 [DEBUG] client/fingerprint_manager.go:83: client.fingerprint_mgr: built-in fingerprints: fingerprinters=["arch", "bridge", "cgroup", "cni", "consul", "cpu", "host", "landlock", "memory", "network", "nomad", "plugins_cni", "si
gnal", "storage", "vault", "env_azure", "env_digitalocean", "env_aws", "env_gce"]
2024-04-04T11:03:56.667+0100 [INFO] fingerprint/cgroup_linux.go:43: client.fingerprint_mgr.cgroup: cgroups are available
2024-04-04T11:03:56.667+0100 [DEBUG] fingerprint/cni.go:31: client.fingerprint_mgr: CNI config dir is not set or does not exist, skipping: cni_config_dir=/opt/cni/config
2024-04-04T11:03:56.667+0100 [DEBUG] client/fingerprint_manager.go:165: client.fingerprint_mgr: fingerprinting periodically: fingerprinter=cgroup initial_period=15s
2024-04-04T11:03:56.667+0100 [DEBUG] client/fingerprint_manager.go:165: client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul initial_period=15s
2024-04-04T11:03:56.667+0100 [DEBUG] fingerprint/cpu.go:74: client.fingerprint_mgr.cpu: detected CPU model: name="Intel(R) Xeon(R) Gold 6354 CPU @ 3.00GHz"
2024-04-04T11:03:56.667+0100 [DEBUG] fingerprint/cpu.go:92: client.fingerprint_mgr.cpu: detected CPU frequency: mhz=2992
2024-04-04T11:03:56.667+0100 [DEBUG] fingerprint/cpu.go:110: client.fingerprint_mgr.cpu: detected CPU core count: EXTRA_VALUE_AT_END=8
2024-04-04T11:03:56.673+0100 [WARN] fingerprint/landlock.go:34: client.fingerprint_mgr.landlock: failed to fingerprint kernel landlock feature: error="function not implemented"
2024-04-04T11:03:56.674+0100 [WARN] fingerprint/network_linux.go:65: client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=lo
2024-04-04T11:03:56.674+0100 [DEBUG] fingerprint/network_linux.go:22: client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
2024-04-04T11:03:56.674+0100 [DEBUG] fingerprint/network.go:98: client.fingerprint_mgr.network: link speed could not be detected and no speed specified by user, falling back to default speed: interface=lo mbits=1000
2024-04-04T11:03:56.674+0100 [DEBUG] fingerprint/network.go:118: client.fingerprint_mgr.network: detected interface IP: interface=lo IP=127.0.0.1
2024-04-04T11:03:56.674+0100 [DEBUG] fingerprint/network.go:118: client.fingerprint_mgr.network: detected interface IP: interface=lo IP=::1
2024-04-04T11:03:56.675+0100 [WARN] fingerprint/network_linux.go:65: client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=lo
2024-04-04T11:03:56.675+0100 [DEBUG] fingerprint/network_linux.go:22: client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
2024-04-04T11:03:56.675+0100 [DEBUG] fingerprint/network.go:147: client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=lo mbits=1000
2024-04-04T11:03:56.676+0100 [WARN] fingerprint/plugins_cni.go:53: client.fingerprint_mgr.cni_plugins: failed to read CNI plugins directory: cni_path=/opt/cni/bin error="open /opt/cni/bin: no such file or directory"
2024-04-04T11:03:56.677+0100 [DEBUG] client/fingerprint_manager.go:165: client.fingerprint_mgr: fingerprinting periodically: fingerprinter=vault initial_period=15s
2024-04-04T11:03:56.681+0100 [DEBUG] fingerprint/env_gce.go:110: client.fingerprint_mgr.env_gce: could not read value for attribute: attribute=machine-type error="Get \"http://169.254.169.254/computeMetadata/v1/instance/machine-type\": context deadline e
xceeded (Client.Timeout exceeded while awaiting headers)"
2024-04-04T11:03:56.681+0100 [DEBUG] fingerprint/env_gce.go:285: client.fingerprint_mgr.env_gce: error querying GCE Metadata URL, skipping
2024-04-04T11:03:56.682+0100 [DEBUG] fingerprint/env_azure.go:95: client.fingerprint_mgr.env_azure: could not read value for attribute: attribute=compute/azEnvironment error="Get \"http://169.254.169.254/metadata/instance/compute/azEnvironment?api-versio
n=2019-06-04&format=text\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2024-04-04T11:03:56.683+0100 [DEBUG] fingerprint/env_digitalocean.go:84: client.fingerprint_mgr.env_digitalocean: failed to request metadata: attribute=region error="Get \"http://169.254.169.254/metadata/v1/region\": context deadline exceeded (Client.Tim
eout exceeded while awaiting headers)"
2024-04-04T11:03:56.683+0100 [DEBUG] client/fingerprint_manager.go:158: client.fingerprint_mgr: detected fingerprints: node_attrs=["arch", "bridge", "cgroup", "cpu", "host", "network", "nomad", "signal", "storage"]
2024-04-04T11:03:56.683+0100 [INFO] pluginmanager/group.go:46: client.plugin: starting plugin manager: plugin-type=csi
2024-04-04T11:03:56.683+0100 [INFO] pluginmanager/group.go:46: client.plugin: starting plugin manager: plugin-type=driver
2024-04-04T11:03:56.683+0100 [INFO] pluginmanager/group.go:46: client.plugin: starting plugin manager: plugin-type=device
2024-04-04T11:03:56.683+0100 [DEBUG] devicemanager/manager.go:145: client.device_mgr: exiting since there are no device plugins
2024-04-04T11:03:56.683+0100 [DEBUG] pluginmanager/group.go:70: client.plugin: waiting on plugin manager initial fingerprint: plugin-type=driver
2024-04-04T11:03:56.683+0100 [DEBUG] pluginmanager/group.go:70: client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device
2024-04-04T11:03:56.683+0100 [DEBUG] pluginmanager/group.go:74: client.plugin: finished plugin manager initial fingerprint: plugin-type=device
2024-04-04T11:03:56.683+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=raw_exec health=healthy description=Healthy
2024-04-04T11:03:56.683+0100 [DEBUG] docker/driver.go:1885: client.driver_mgr.docker: using client connection initialized from environment: driver=docker
2024-04-04T11:03:56.683+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=exec health=undetected description="Driver must run as root"
2024-04-04T11:03:56.683+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=java health=undetected description="Driver must run as root"
2024-04-04T11:03:56.684+0100 [ERROR] docker/driver.go:830: client.driver_mgr.docker: failed to list pause containers for recovery: driver=docker error="Get \"http://unix.sock/containers/json?filters=%7B%22label%22%3A%5B%22com.hashicorp.nomad.alloc_id%22%
5D%7D\": dial unix /var/run/docker.sock: connect: no such file or directory"
2024-04-04T11:03:56.684+0100 [DEBUG] docker/fingerprint.go:116: client.driver_mgr.docker: could not connect to docker daemon: driver=docker endpoint=unix:///var/run/docker.sock error="Get \"http://unix.sock/version\": dial unix /var/run/docker.sock: conn
ect: no such file or directory"
2024-04-04T11:03:56.684+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=docker health=undetected description="Failed to connect to docker daemon"
2024-04-04T11:03:56.684+0100 [DEBUG] servers/manager.go:208: client.server_mgr: new server list: new_servers=[0.0.0.0:4647, 10.57.20.156:4647] old_servers=[]
2024-04-04T11:03:56.687+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=qemu health=undetected description=""
2024-04-04T11:03:56.687+0100 [DEBUG] drivermanager/manager.go:281: client.driver_mgr: detected drivers: drivers="map[healthy:[raw_exec] undetected:[exec java docker qemu]]"
2024-04-04T11:03:56.687+0100 [DEBUG] pluginmanager/group.go:74: client.plugin: finished plugin manager initial fingerprint: plugin-type=driver
2024-04-04T11:03:56.687+0100 [INFO] client/client.go:588: client: started client: node_id=d25b0ede-b22e-7ffc-d641-8393267a1567
2024-04-04T11:03:56.687+0100 [DEBUG] agent/http.go:510: http: UI is enabled
2024-04-04T11:03:56.687+0100 [DEBUG] agent/http.go:510: http: UI is enabled
2024-04-04T11:03:57.914+0100 [WARN] [email protected]/raft.go:241: nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2024-04-04T11:03:57.914+0100 [INFO] [email protected]/raft.go:285: nomad.raft: entering candidate state: node="Node at 10.57.20.156:4647 [Candidate]" term=2
2024-04-04T11:03:57.914+0100 [DEBUG] [email protected]/raft.go:1869: nomad.raft: voting for self: term=2 id=8180a835-a887-8df5-ce2f-74638feed68f
2024-04-04T11:03:57.914+0100 [DEBUG] [email protected]/raft.go:304: nomad.raft: calculated votes needed: needed=1 term=2
2024-04-04T11:03:57.914+0100 [DEBUG] [email protected]/raft.go:327: nomad.raft: vote granted: from=8180a835-a887-8df5-ce2f-74638feed68f term=2 tally=1
2024-04-04T11:03:57.914+0100 [INFO] [email protected]/raft.go:332: nomad.raft: election won: term=2 tally=1
2024-04-04T11:03:57.914+0100 [INFO] [email protected]/raft.go:421: nomad.raft: entering leader state: leader="Node at 10.57.20.156:4647 [Leader]"
2024-04-04T11:03:57.914+0100 [INFO] nomad/leader.go:104: nomad: cluster leadership acquired
2024-04-04T11:03:57.916+0100 [DEBUG] [email protected]/run.go:103: nomad.autopilot: autopilot is now running
2024-04-04T11:03:57.916+0100 [DEBUG] [email protected]/run.go:175: nomad.autopilot: state update routine is now running
2024-04-04T11:03:57.916+0100 [INFO] nomad/leader.go:2732: nomad.core: established cluster id: cluster_id=3b95c2a8-7d89-6a64-db4e-02511408bf9e create_time=1712225037916341223
2024-04-04T11:03:57.916+0100 [INFO] nomad/leader.go:2784: nomad: eval broker status modified: paused=false
2024-04-04T11:03:57.916+0100 [INFO] nomad/leader.go:2791: nomad: blocked evals status modified: paused=false
2024-04-04T11:03:57.918+0100 [INFO] nomad/leader.go:2717: nomad.keyring: initialized keyring: id=ab5a3307-d5ea-2798-4a7f-8d09e4b1ac8c
2024-04-04T11:03:58.144+0100 [DEBUG] servers/manager.go:208: client.server_mgr: new server list: new_servers=[10.57.20.156:4647] old_servers=[0.0.0.0:4647, 10.57.20.156:4647]
2024-04-04T11:03:58.144+0100 [INFO] client/client.go:2030: client: node registration complete
2024-04-04T11:03:58.144+0100 [DEBUG] client/client.go:2451: client: updated allocations: index=1 total=0 pulled=0 filtered=0
2024-04-04T11:03:58.144+0100 [DEBUG] client/client.go:2524: client: allocation updates: added=0 removed=0 updated=0 ignored=0
2024-04-04T11:03:58.144+0100 [DEBUG] client/client.go:2568: client: allocation updates applied: added=0 removed=0 updated=0 ignored=0 errors=0
2024-04-04T11:03:58.145+0100 [DEBUG] client/client.go:2077: client: state updated: node_status=ready
2024-04-04T11:03:59.145+0100 [DEBUG] client/client.go:2497: client: state changed, updating node and re-registering
2024-04-04T11:03:59.146+0100 [INFO] client/client.go:2030: client: node registration complete
==> Failed to check for updates: Get "https://checkpoint-api.hashicorp.com/v1/check/nomad?arch=amd64&os=linux&signature=&version=1.6.9": Proxy Authentication Required
the startup log of nomad v1.7.6
nomad/nomad1.7.6 agent -dev -log-include-location -log-level=debug -bind 0.0.0.0
==> No configuration files loaded
==> Starting Nomad agent...
==> Nomad agent configuration:
Advertise Addrs: HTTP: 10.57.20.156:4646; RPC: 10.57.20.156:4647; Serf: 10.57.20.156:4648
Bind Addrs: HTTP: [0.0.0.0:4646]; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
Client: true
Log Level: debug
Node Id: 4bee006f-9a4c-dc14-c7de-e89afdde439b
Region: global (DC: dc1)
Server: true
Version: 1.7.6
==> Nomad agent started! Log data will stream in below:
2024-04-04T10:54:26.543+0100 [DEBUG] nomad/server.go:440: nomad: issuer not set; OIDC Discovery endpoint for workload identities disabled
2024-04-04T10:54:26.544+0100 [INFO] [email protected]/api.go:591: nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:a3882dfa-e3b0-d1d5-c45f-8cbc011d941b Address:10.57.20.156:4647}]"
2024-04-04T10:54:26.545+0100 [INFO] [email protected]/raft.go:159: nomad.raft: entering follower state: follower="Node at 10.57.20.156:4647 [Follower]" leader-address= leader-id=
2024-04-04T10:54:26.545+0100 [INFO] [email protected]/stdlog.go:60: nomad: serf: EventMemberJoin: txvslgcath004.global 10.57.20.156
2024-04-04T10:54:26.545+0100 [INFO] nomad/server.go:1818: nomad: starting scheduling worker(s): num_workers=8 schedulers=["service", "batch", "system", "sysbatch", "_core"]
2024-04-04T10:54:26.545+0100 [DEBUG] nomad/server.go:1825: nomad: started scheduling worker: id=32044e35-0ef3-c4ad-0ae0-f2e09f176c04 index=1 of=8
2024-04-04T10:54:26.545+0100 [DEBUG] nomad/server.go:1825: nomad: started scheduling worker: id=c42ddd46-caa9-1af6-7ce8-2ea34b3a82c1 index=2 of=8
2024-04-04T10:54:26.545+0100 [DEBUG] nomad/server.go:1825: nomad: started scheduling worker: id=b962e4f2-6e81-bba5-35dc-022011ce839e index=3 of=8
2024-04-04T10:54:26.545+0100 [DEBUG] nomad/server.go:1825: nomad: started scheduling worker: id=2eed5234-f8e6-eb14-74cd-bbc632b1d04a index=4 of=8
2024-04-04T10:54:26.545+0100 [DEBUG] nomad/server.go:1825: nomad: started scheduling worker: id=4c94eb32-03a1-043c-39b2-2dcabe99ef5f index=5 of=8
2024-04-04T10:54:26.545+0100 [DEBUG] nomad/server.go:1825: nomad: started scheduling worker: id=fc2af168-830c-c3de-471c-97c9bdfa4ead index=6 of=8
2024-04-04T10:54:26.545+0100 [DEBUG] nomad/server.go:1825: nomad: started scheduling worker: id=9997d867-45f6-3527-bddc-220cb29a2da4 index=7 of=8
2024-04-04T10:54:26.545+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=32044e35-0ef3-c4ad-0ae0-f2e09f176c04
2024-04-04T10:54:26.545+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=b962e4f2-6e81-bba5-35dc-022011ce839e
2024-04-04T10:54:26.545+0100 [DEBUG] nomad/server.go:1825: nomad: started scheduling worker: id=869b07c8-c751-2524-6ee5-fb2d10f2674d index=8 of=8
2024-04-04T10:54:26.545+0100 [INFO] nomad/server.go:1830: nomad: started scheduling worker(s): num_workers=8 schedulers=["service", "batch", "system", "sysbatch", "_core"]
2024-04-04T10:54:26.545+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=869b07c8-c751-2524-6ee5-fb2d10f2674d
2024-04-04T10:54:26.545+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=fc2af168-830c-c3de-471c-97c9bdfa4ead
2024-04-04T10:54:26.545+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=2eed5234-f8e6-eb14-74cd-bbc632b1d04a
2024-04-04T10:54:26.545+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=c42ddd46-caa9-1af6-7ce8-2ea34b3a82c1
2024-04-04T10:54:26.545+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=4c94eb32-03a1-043c-39b2-2dcabe99ef5f
2024-04-04T10:54:26.545+0100 [DEBUG] nomad/encrypter.go:555: nomad.keyring.replicator: starting encryption key replication
2024-04-04T10:54:26.545+0100 [INFO] nomad/serf.go:62: nomad: adding server: server="txvslgcath004.global (Addr: 10.57.20.156:4647) (DC: dc1)"
2024-04-04T10:54:26.545+0100 [DEBUG] nomad/worker.go:403: worker: running: worker_id=9997d867-45f6-3527-bddc-220cb29a2da4
2024-04-04T10:54:26.546+0100 [DEBUG] docker/driver.go:1870: agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=""
2024-04-04T10:54:26.547+0100 [INFO] agent/plugins.go:41: agent: detected plugin: name=java type=driver plugin_version=0.1.0
2024-04-04T10:54:26.547+0100 [INFO] agent/plugins.go:41: agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2024-04-04T10:54:26.547+0100 [INFO] agent/plugins.go:41: agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2024-04-04T10:54:26.547+0100 [INFO] agent/plugins.go:41: agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2024-04-04T10:54:26.547+0100 [INFO] agent/plugins.go:41: agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2024-04-04T10:54:26.547+0100 [INFO] client/client.go:665: client: using state directory: state_dir=/tmp/NomadClient3325952921
2024-04-04T10:54:26.547+0100 [INFO] client/client.go:712: client: using alloc directory: alloc_dir=/tmp/NomadClient1555522694
2024-04-04T10:54:26.547+0100 [INFO] client/client.go:720: client: using dynamic ports: min=20000 max=32000 reserved=""
2024-04-04T10:54:26.553+0100 [DEBUG] client/fingerprint_manager.go:88: client.fingerprint_mgr: built-in fingerprints: fingerprinters=["arch", "bridge", "cgroup", "cni", "consul", "cpu", "host", "landlock", "memory", "network", "nomad", "p
lugins_cni", "signal", "storage", "vault", "env_gce", "env_azure", "env_digitalocean", "env_aws"]
2024-04-04T10:54:26.565+0100 [DEBUG] fingerprint/cni.go:31: client.fingerprint_mgr: CNI config dir is not set or does not exist, skipping: cni_config_dir=/opt/cni/config
2024-04-04T10:54:26.566+0100 [DEBUG] client/fingerprint_manager.go:170: client.fingerprint_mgr: fingerprinting periodically: fingerprinter=consul initial_period=15s
2024-04-04T10:54:26.567+0100 [DEBUG] fingerprint/cpu.go:104: client.fingerprint_mgr.cpu: detected CPU model: name="Intel(R) Xeon(R) Gold 6354 CPU @ 3.00GHz"
2024-04-04T10:54:26.567+0100 [DEBUG] fingerprint/cpu.go:122: client.fingerprint_mgr.cpu: detected CPU frequency: mhz=2992
2024-04-04T10:54:26.567+0100 [DEBUG] fingerprint/cpu.go:148: client.fingerprint_mgr.cpu: detected CPU core count: cores=8
2024-04-04T10:54:26.572+0100 [WARN] fingerprint/landlock.go:34: client.fingerprint_mgr.landlock: failed to fingerprint kernel landlock feature: error="function not implemented"
2024-04-04T10:54:26.574+0100 [WARN] fingerprint/network_linux.go:65: client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=lo
2024-04-04T10:54:26.574+0100 [DEBUG] fingerprint/network_linux.go:22: client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
2024-04-04T10:54:26.574+0100 [DEBUG] fingerprint/network.go:98: client.fingerprint_mgr.network: link speed could not be detected and no speed specified by user, falling back to default speed: interface=lo mbits=1000
2024-04-04T10:54:26.574+0100 [DEBUG] fingerprint/network.go:113: client.fingerprint_mgr.network: detected interface IP: interface=lo IP=127.0.0.1
2024-04-04T10:54:26.574+0100 [DEBUG] fingerprint/network.go:113: client.fingerprint_mgr.network: detected interface IP: interface=lo IP=::1
2024-04-04T10:54:26.575+0100 [WARN] fingerprint/network_linux.go:65: client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=lo
2024-04-04T10:54:26.575+0100 [DEBUG] fingerprint/network_linux.go:22: client.fingerprint_mgr.network: unable to read link speed: path=/sys/class/net/lo/speed device=lo
2024-04-04T10:54:26.575+0100 [DEBUG] fingerprint/network.go:142: client.fingerprint_mgr.network: link speed could not be detected, falling back to default speed: interface=lo mbits=1000
2024-04-04T10:54:26.576+0100 [WARN] fingerprint/plugins_cni.go:53: client.fingerprint_mgr.cni_plugins: failed to read CNI plugins directory: cni_path=/opt/cni/bin error="open /opt/cni/bin: no such file or directory"
2024-04-04T10:54:26.577+0100 [DEBUG] client/fingerprint_manager.go:170: client.fingerprint_mgr: fingerprinting periodically: fingerprinter=vault initial_period=15s
2024-04-04T10:54:26.581+0100 [DEBUG] fingerprint/env_gce.go:110: client.fingerprint_mgr.env_gce: could not read value for attribute: attribute=machine-type error="Get \"http://169.254.169.254/computeMetadata/v1/instance/machine-type\": co
ntext deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2024-04-04T10:54:26.581+0100 [DEBUG] fingerprint/env_gce.go:285: client.fingerprint_mgr.env_gce: error querying GCE Metadata URL, skipping
2024-04-04T10:54:26.582+0100 [DEBUG] fingerprint/env_azure.go:94: client.fingerprint_mgr.env_azure: could not read value for attribute: attribute=compute/azEnvironment error="Get \"http://169.254.169.254/metadata/instance/compute/azEnviro
nment?api-version=2019-06-04&format=text\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2024-04-04T10:54:26.583+0100 [DEBUG] fingerprint/env_digitalocean.go:84: client.fingerprint_mgr.env_digitalocean: failed to request metadata: attribute=region error="Get \"http://169.254.169.254/metadata/v1/region\": context deadline exce
eded (Client.Timeout exceeded while awaiting headers)"
2024-04-04T10:54:26.583+0100 [DEBUG] client/fingerprint_manager.go:163: client.fingerprint_mgr: detected fingerprints: node_attrs=["arch", "bridge", "cpu", "host", "network", "nomad", "signal", "storage"]
2024-04-04T10:54:26.583+0100 [INFO] cgroupslib/init.go:28: client.proclib.cg2: initializing nomad cgroups: cores=0-7
2024-04-04T10:54:26.583+0100 [INFO] pluginmanager/group.go:46: client.plugin: starting plugin manager: plugin-type=csi
2024-04-04T10:54:26.583+0100 [INFO] pluginmanager/group.go:46: client.plugin: starting plugin manager: plugin-type=driver
2024-04-04T10:54:26.583+0100 [INFO] pluginmanager/group.go:46: client.plugin: starting plugin manager: plugin-type=device
2024-04-04T10:54:26.583+0100 [DEBUG] devicemanager/manager.go:145: client.device_mgr: exiting since there are no device plugins
2024-04-04T10:54:26.583+0100 [DEBUG] pluginmanager/group.go:70: client.plugin: waiting on plugin manager initial fingerprint: plugin-type=driver
2024-04-04T10:54:26.583+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=exec health=undetected description="Driver must run as root"
2024-04-04T10:54:26.583+0100 [DEBUG] docker/driver.go:1870: client.driver_mgr.docker: using client connection initialized from environment: driver=docker
2024-04-04T10:54:26.583+0100 [DEBUG] pluginmanager/group.go:70: client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device
2024-04-04T10:54:26.584+0100 [DEBUG] pluginmanager/group.go:74: client.plugin: finished plugin manager initial fingerprint: plugin-type=device
2024-04-04T10:54:26.584+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=docker health=undetected description="Driver must run as root"
2024-04-04T10:54:26.583+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=java health=undetected description="Driver must run as root"
2024-04-04T10:54:26.583+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=raw_exec health=healthy description=Healthy
2024-04-04T10:54:26.583+0100 [DEBUG] drivermanager/instance.go:386: client.driver_mgr: initial driver fingerprint: driver=qemu health=undetected description=""
2024-04-04T10:54:26.584+0100 [DEBUG] drivermanager/manager.go:281: client.driver_mgr: detected drivers: drivers="map[healthy:[raw_exec] undetected:[exec docker java qemu]]"
2024-04-04T10:54:26.584+0100 [DEBUG] pluginmanager/group.go:74: client.plugin: finished plugin manager initial fingerprint: plugin-type=driver
2024-04-04T10:54:26.584+0100 [ERROR] docker/driver.go:823: client.driver_mgr.docker: failed to list pause containers for recovery: driver=docker error="Get \"http://unix.sock/containers/json?filters=%7B%22label%22%3A%5B%22com.hashicorp.no
mad.alloc_id%22%5D%7D\": dial unix /var/run/docker.sock: connect: no such file or directory"
2024-04-04T10:54:26.585+0100 [DEBUG] servers/manager.go:208: client.server_mgr: new server list: new_servers=[0.0.0.0:4647, 10.57.20.156:4647] old_servers=[]
2024-04-04T10:54:26.585+0100 [INFO] client/client.go:630: client: started client: node_id=e5be6681-6198-1191-5a05-0d1226c37f81
2024-04-04T10:54:26.585+0100 [DEBUG] agent/http.go:515: http: UI is enabled
2024-04-04T10:54:26.586+0100 [DEBUG] agent/http.go:515: http: UI is enabled
2024-04-04T10:54:28.174+0100 [WARN] [email protected]/raft.go:241: nomad.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2024-04-04T10:54:28.174+0100 [INFO] [email protected]/raft.go:285: nomad.raft: entering candidate state: node="Node at 10.57.20.156:4647 [Candidate]" term=2
2024-04-04T10:54:28.174+0100 [DEBUG] [email protected]/raft.go:1869: nomad.raft: voting for self: term=2 id=a3882dfa-e3b0-d1d5-c45f-8cbc011d941b
2024-04-04T10:54:28.174+0100 [DEBUG] [email protected]/raft.go:304: nomad.raft: calculated votes needed: needed=1 term=2
2024-04-04T10:54:28.174+0100 [DEBUG] [email protected]/raft.go:327: nomad.raft: vote granted: from=a3882dfa-e3b0-d1d5-c45f-8cbc011d941b term=2 tally=1
2024-04-04T10:54:28.174+0100 [INFO] [email protected]/raft.go:332: nomad.raft: election won: term=2 tally=1
2024-04-04T10:54:28.174+0100 [INFO] [email protected]/raft.go:421: nomad.raft: entering leader state: leader="Node at 10.57.20.156:4647 [Leader]"
2024-04-04T10:54:28.174+0100 [INFO] nomad/leader.go:109: nomad: cluster leadership acquired
2024-04-04T10:54:28.176+0100 [DEBUG] [email protected]/run.go:103: nomad.autopilot: autopilot is now running
2024-04-04T10:54:28.176+0100 [DEBUG] [email protected]/run.go:175: nomad.autopilot: state update routine is now running
2024-04-04T10:54:28.176+0100 [INFO] nomad/leader.go:2798: nomad.core: established cluster id: cluster_id=3342784b-d650-3b98-4c50-cdd043d4fc54 create_time=1712224468176447102
2024-04-04T10:54:28.176+0100 [INFO] nomad/leader.go:2850: nomad: eval broker status modified: paused=false
2024-04-04T10:54:28.176+0100 [INFO] nomad/leader.go:2857: nomad: blocked evals status modified: paused=false
2024-04-04T10:54:28.345+0100 [DEBUG] servers/manager.go:208: client.server_mgr: new server list: new_servers=[10.57.20.156:4647] old_servers=[0.0.0.0:4647, 10.57.20.156:4647]
2024-04-04T10:54:28.345+0100 [INFO] client/client.go:2041: client: node registration complete
2024-04-04T10:54:28.345+0100 [DEBUG] client/client.go:2088: client: state updated: node_status=ready
2024-04-04T10:54:28.345+0100 [DEBUG] client/client.go:2462: client: updated allocations: index=1 total=0 pulled=0 filtered=0
2024-04-04T10:54:28.346+0100 [DEBUG] client/client.go:2535: client: allocation updates: added=0 removed=0 updated=0 ignored=0
2024-04-04T10:54:28.346+0100 [DEBUG] client/client.go:2579: client: allocation updates applied: added=0 removed=0 updated=0 ignored=0 errors=0
2024-04-04T10:54:28.652+0100 [INFO] nomad/leader.go:2783: nomad.keyring: initialized keyring: id=be29d285-56a1-0e0b-92fa-12e930e5bddb
2024-04-04T10:54:29.345+0100 [DEBUG] client/client.go:2508: client: state changed, updating node and re-registering
2024-04-04T10:54:29.346+0100 [INFO] client/client.go:2041: client: node registration complete
==> Failed to check for updates: Get "https://checkpoint-api.hashicorp.com/v1/check/nomad?arch=amd64&os=linux&signature=&version=1.7.6": Proxy Authentication Required
Nomad Client logs (if appropriate)
FYI, this post seems like reporting exactly same issue with mine. https://discuss.hashicorp.com/t/nomad-1-7-3-upgraded-from-1-0-4-cpu-and-memory-allocation-showing-as-zero/62062
Hi @simon1990zcs and thanks for raising this issue. The Nomad client process needs to run as root and therefore this could be causing the issue you have seen. Recent cgroup changes and other isolation changes to support features like NUMA and cgroupv2 are likely to be involved here. This issue has some useful additional information.
Allocation resource usage is collected by the Nomad clients, and therefore the logs from those agents might hold the information and errors to diagnose this problem.
Hi, @jrasell , thanks for looking into it.
- In terms of client side logs, surprisingly, there is not much when I run
nomad alloc status <allocId>
under nomad v1.7.6 (same output for v1.6.9).
2024-04-05T10:30:11.944+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/allocations?prefix=8ccbe83b duration="300.194µs"
2024-04-05T10:30:11.947+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/allocation/8ccbe83b-fc7a-4001-4943-febed50fbd6d?namespace=default duration="551.809µs"
2024-04-05T10:30:11.952+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/client/allocation/8ccbe83b-fc7a-4001-4943-febed50fbd6d/checks duration="436.809µs"
2024-04-05T10:30:11.954+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/client/allocation/8ccbe83b-fc7a-4001-4943-febed50fbd6d/stats duration="393.442µs"
- In terms of running nomad client as non-root user, I am aware of drawbacks, aware that we don't have root privilege to tools like cgroup for resource isolation. That's all fine for me, since I can't do it myself as non-root user, and I don't expect nomad would be able to do it for me. I am very appreciate that nomad provides RAW_EXEC driver windows for non-root user, so that nomad is able to behave as me to run/kill jobs, like python-service in the demo here.
- However, speaking of querying allocation resource usage, this doesn't require root access at all in my understanding. That's why I consider this to be a bug here.
I did check nomad codebase very briefly, I believe nomad query allocation/process usage through here. It basically use methods ofprocess.MemoryInfoWithContext(context)
andprocess.TimesWithContext(context)
from library https://github.com/shirou/gopsutil, and I did check their codebase a little bit as well, they use standard linux solution, by reading/proc/{pid}/statm
and/proc/{pid}/stat
. Again, reading these files doesn't require root access at all. Since no error was logged here at all, I guess nomad silently return zero earlier somewhere. further investigation would require me to debug in source code, so I raised this ticket for help, hope you can help addressing this issue.
Let me know if you need any more information.
in case you need full logs from nomad clients, including the logs from running nomad job run python-service.hcl
and nomad alloc status <allocId>
, below are logs right after startup log.
2024-04-05T11:17:17.222+0100 [DEBUG] nomad/worker.go:524: worker: dequeued evaluation: worker_id=8c4e7ab8-ffdf-e23d-6428-f7940a1fa213 eval_id=09539011-9833-570d-72c3-2dcd9a08630b type=service namespace=default job_id=python-service node_id="" tri
ggered_by=job-register
2024-04-05T11:17:17.222+0100 [DEBUG] agent/http.go:714: http: request complete: method=PUT path=/v1/jobs duration=2.057555ms
2024-04-05T11:17:17.222+0100 [DEBUG] scheduler/generic_sched.go:390: worker.service_sched: reconciled current state with desired state: eval_id=09539011-9833-570d-72c3-2dcd9a08630b job_id=python-service namespace=default worker_id=8c4e7ab8-ffdf-e
23d-6428-f7940a1fa213
results=
| Total changes: (place 1) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0)
| Created Deployment: "e9de8c7a-23f6-e68f-84d7-fe07778ccb4e"
| Desired Changes for "python-service-nomad": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)
2024-04-05T11:17:17.225+0100 [DEBUG] nomad/worker.go:689: worker: submitted plan for evaluation: worker_id=8c4e7ab8-ffdf-e23d-6428-f7940a1fa213 eval_id=09539011-9833-570d-72c3-2dcd9a08630b
2024-04-05T11:17:17.225+0100 [DEBUG] scheduler/util.go:629: worker.service_sched: setting eval status: eval_id=09539011-9833-570d-72c3-2dcd9a08630b job_id=python-service namespace=default worker_id=8c4e7ab8-ffdf-e23d-6428-f7940a1fa213 status=comp
lete
2024-04-05T11:17:17.225+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/evaluation/09539011-9833-570d-72c3-2dcd9a08630b duration="168.904µs"
2024-04-05T11:17:17.225+0100 [DEBUG] nomad/worker.go:752: worker: updated evaluation: worker_id=8c4e7ab8-ffdf-e23d-6428-f7940a1fa213 eval="<Eval \"09539011-9833-570d-72c3-2dcd9a08630b\" JobID: \"python-service\" Namespace: \"default\">"
2024-04-05T11:17:17.225+0100 [DEBUG] nomad/worker.go:559: worker: ack evaluation: worker_id=8c4e7ab8-ffdf-e23d-6428-f7940a1fa213 eval_id=09539011-9833-570d-72c3-2dcd9a08630b type=service namespace=default job_id=python-service node_id="" triggere
d_by=job-register
2024-04-05T11:17:17.226+0100 [DEBUG] client/client.go:2462: client: updated allocations: index=12 total=1 pulled=1 filtered=0
2024-04-05T11:17:17.226+0100 [DEBUG] client/client.go:2535: client: allocation updates: added=1 removed=0 updated=0 ignored=0
2024-04-05T11:17:17.226+0100 [INFO] taskrunner/task_runner.go:1402: client.alloc_runner.task_runner: Task event: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad type=Received msg="Task received by client" failed=false
2024-04-05T11:17:17.226+0100 [DEBUG] client/client.go:2579: client: allocation updates applied: added=1 removed=0 updated=0 ignored=0 errors=0
2024-04-05T11:17:17.226+0100 [DEBUG] widmgr/widmgr.go:115: client.alloc_runner.widmgr: retrieving and renewing workload identities: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed num_identities=0
2024-04-05T11:17:17.227+0100 [DEBUG] taskrunner/task_runner.go:591: client.alloc_runner.task_runner: lifecycle start condition has been met, proceeding: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad
2024-04-05T11:17:17.227+0100 [INFO] taskrunner/task_runner.go:1402: client.alloc_runner.task_runner: Task event: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad type="Task Setup" msg="Building Task Directory" failed=false
2024-04-05T11:17:17.227+0100 [DEBUG] cmdrunner/cmd_runner.go:73: client.alloc_runner.task_runner.task_hook.logmon: starting plugin: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad path=/dat/cathedral_core/developers/nomad/nomad1.
7.6 args=["/dat/cathedral_core/developers/nomad/nomad1.7.6", "logmon"]
2024-04-05T11:17:17.228+0100 [DEBUG] cmdrunner/cmd_runner.go:80: client.alloc_runner.task_runner.task_hook.logmon: plugin started: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad path=/dat/cathedral_core/developers/nomad/nomad1.7
.6 pid=17300
2024-04-05T11:17:17.228+0100 [DEBUG] [email protected]/client.go:825: client.alloc_runner.task_runner.task_hook.logmon: waiting for RPC address: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad plugin=/dat/cathedral_core/developers
/nomad/nomad1.7.6
2024-04-05T11:17:17.228+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/evaluation/09539011-9833-570d-72c3-2dcd9a08630b/allocations duration="264.661µs"
2024-04-05T11:17:17.258+0100 [DEBUG] [email protected]/client.go:1224: client.alloc_runner.task_runner.task_hook.logmon.nomad1.7.6: plugin address: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad @module=logmon address=/tmp/plugin
1256268056 network=unix timestamp="2024-04-05T11:17:17.258+0100"
2024-04-05T11:17:17.258+0100 [DEBUG] [email protected]/client.go:878: client.alloc_runner.task_runner.task_hook.logmon: using plugin: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad version=2
2024-04-05T11:17:17.259+0100 [DEBUG] [email protected]/client.go:1224: client.alloc_runner.task_runner.task_hook.logmon.nomad1.7.6: opening fifo: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad path=/tmp/NomadClient2485119766/50a8
fb0d-ea62-4062-d45a-7ba0aa0c3aed/alloc/logs/.web-app-nomad.stdout.fifo @module=logmon timestamp="2024-04-05T11:17:17.259+0100"
2024-04-05T11:17:17.259+0100 [DEBUG] [email protected]/client.go:1224: client.alloc_runner.task_runner.task_hook.logmon.nomad1.7.6: opening fifo: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad @module=logmon path=/tmp/NomadClient
2485119766/50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed/alloc/logs/.web-app-nomad.stderr.fifo timestamp="2024-04-05T11:17:17.259+0100"
2024-04-05T11:17:17.262+0100 [INFO] rawexec/driver.go:322: client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:python Args:[-m http.server 20186]}"
2024-04-05T11:17:17.262+0100 [DEBUG] cmdrunner/cmd_runner.go:73: client.driver_mgr.raw_exec.executor: starting plugin: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed driver=raw_exec task_name=web-app-nomad path=/dat/cathedral_core/developers/nomad
/nomad1.7.6 args=["/dat/cathedral_core/developers/nomad/nomad1.7.6", "executor", "{\"LogFile\":\"/tmp/NomadClient2485119766/50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed/web-app-nomad/executor.out\",\"LogLevel\":\"debug\",\"FSIsolation\":false,\"Compute\":{\"
tc\":23936,\"nc\":8}}"]
2024-04-05T11:17:17.263+0100 [DEBUG] cmdrunner/cmd_runner.go:80: client.driver_mgr.raw_exec.executor: plugin started: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed driver=raw_exec task_name=web-app-nomad path=/dat/cathedral_core/developers/nomad/
nomad1.7.6 pid=17312
2024-04-05T11:17:17.263+0100 [DEBUG] [email protected]/client.go:825: client.driver_mgr.raw_exec.executor: waiting for RPC address: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed driver=raw_exec task_name=web-app-nomad plugin=/dat/cathedral_core/de
velopers/nomad/nomad1.7.6
2024-04-05T11:17:17.293+0100 [DEBUG] [email protected]/client.go:878: client.driver_mgr.raw_exec.executor: using plugin: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed driver=raw_exec task_name=web-app-nomad version=2
2024-04-05T11:17:17.300+0100 [INFO] taskrunner/task_runner.go:1402: client.alloc_runner.task_runner: Task event: alloc_id=50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed task=web-app-nomad type=Started msg="Task started by client" failed=false
2024-04-05T11:17:17.950+0100 [DEBUG] client/client.go:2462: client: updated allocations: index=14 total=1 pulled=0 filtered=1
2024-04-05T11:17:17.951+0100 [DEBUG] client/client.go:2535: client: allocation updates: added=0 removed=0 updated=0 ignored=1
2024-04-05T11:17:17.951+0100 [DEBUG] client/client.go:2579: client: allocation updates applied: added=0 removed=0 updated=0 ignored=1 errors=0
2024-04-05T11:17:28.039+0100 [DEBUG] client/client.go:2462: client: updated allocations: index=15 total=1 pulled=0 filtered=1
2024-04-05T11:17:28.039+0100 [DEBUG] client/client.go:2535: client: allocation updates: added=0 removed=0 updated=0 ignored=1
2024-04-05T11:17:28.039+0100 [DEBUG] client/client.go:2579: client: allocation updates applied: added=0 removed=0 updated=0 ignored=1 errors=0
2024-04-05T11:17:29.290+0100 [DEBUG] nomad/worker.go:524: worker: dequeued evaluation: worker_id=6d9672fe-72c4-16b0-c8db-058cf950906f eval_id=8d8ce2ef-23a5-414a-a93a-ffd2b4a72545 type=service namespace=default job_id=python-service node_id="" tri
ggered_by=deployment-watcher
2024-04-05T11:17:29.291+0100 [DEBUG] scheduler/generic_sched.go:390: worker.service_sched: reconciled current state with desired state: eval_id=8d8ce2ef-23a5-414a-a93a-ffd2b4a72545 job_id=python-service namespace=default worker_id=6d9672fe-72c4-1
6b0-c8db-058cf950906f
results=
| Total changes: (place 0) (destructive 0) (inplace 0) (stop 0) (disconnect 0) (reconnect 0)
| Deployment Update for ID "e9de8c7a-23f6-e68f-84d7-fe07778ccb4e": Status "successful"; Description "Deployment completed successfully"
| Desired Changes for "python-service-nomad": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0)
2024-04-05T11:17:29.291+0100 [DEBUG] nomad/worker.go:689: worker: submitted plan for evaluation: worker_id=6d9672fe-72c4-16b0-c8db-058cf950906f eval_id=8d8ce2ef-23a5-414a-a93a-ffd2b4a72545
2024-04-05T11:17:29.291+0100 [DEBUG] scheduler/util.go:629: worker.service_sched: setting eval status: eval_id=8d8ce2ef-23a5-414a-a93a-ffd2b4a72545 job_id=python-service namespace=default worker_id=6d9672fe-72c4-16b0-c8db-058cf950906f status=comp
lete
2024-04-05T11:17:29.292+0100 [DEBUG] nomad/worker.go:752: worker: updated evaluation: worker_id=6d9672fe-72c4-16b0-c8db-058cf950906f eval="<Eval \"8d8ce2ef-23a5-414a-a93a-ffd2b4a72545\" JobID: \"python-service\" Namespace: \"default\">"
2024-04-05T11:17:29.292+0100 [DEBUG] nomad/worker.go:559: worker: ack evaluation: worker_id=6d9672fe-72c4-16b0-c8db-058cf950906f eval_id=8d8ce2ef-23a5-414a-a93a-ffd2b4a72545 type=service namespace=default job_id=python-service node_id="" triggere
d_by=deployment-watcher
2024-04-05T11:18:10.998+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/allocations?prefix=50a8fb0d- duration="339.696µs"
2024-04-05T11:18:11.002+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/allocation/50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed?namespace=default duration="244.028µs"
2024-04-05T11:18:11.007+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/client/allocation/50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed/checks duration="114.396µs"
2024-04-05T11:18:11.009+0100 [DEBUG] agent/http.go:714: http: request complete: method=GET path=/v1/client/allocation/50a8fb0d-ea62-4062-d45a-7ba0aa0c3aed/stats duration="155.035µs"
Now I assure more this is a bug, since it works fine on Windows and MacOs nomad clients, but not Linux, or at least CentOS 7.6. Can you please advise with some code references about entry points to clients's task/allocation resources usage? so that I can download code and debug further what's going wrong on Linux platform (or CentOs 7.6)
@jrasell hey, James, I found the root cause after debugging in code base. Basically drivers/shared/executor/procstats/getstats.go@scanPIDs() is supposed to fetch detected live pids that make up the task process tree / as in the tasks group. However, it updates with empty pids in my case. because executor_universal_linux.go@ListProcesses() is assuming that cgroup is available on client machine, which ends up zero live pids when cgroup is not actually used.
func (e *UniversalExecutor) ListProcesses() *set.Set[procstats.ProcessID] {
return procstats.List(e.command)
}
One proposal of code fixing is adding a condition here, to allow a fallback logic when cgroup is missing, as below.
func (e *UniversalExecutor) ListProcesses() *set.Set[procstats.ProcessID] {
if cgroupslib.GetMode() != cgroupslib.OFF {
return procstats.List(e.command)
} else {
return procstats.ListByPid(e.childCmd.Process.Pid)
}
}
The procstatc.ListByPid
mentioned in proposal is exactly same to list_default.go.List(executorPID int), so I suggest we can refactor it to common package and rename to ListByPid, eg. procstats.go@ListByPid(executorPID int)