nomad
nomad copied to clipboard
Problems when running tasks using podman driver (Inconsistent UI and allocations are stuck)
Nomad version
Nomad v1.4.1 (2aa7e66bdb526e25f59883952d74dad7ea9a014e)
agent: detected plugin: name=podman type=driver plugin_version=0.4.0
Operating system and Environment details
Linux nomad-worker-1 5.19.7-1-pve #1 SMP PREEMPT_DYNAMIC PVE 5.19.7-1 (Tue, 06 Sep 2022 07:54:58 + x86_64 x86_64 x86_64 GNU/Linux
Issue
Inconsistent reporting and wrong behaviour of tasks using podman driver.
It seems Nomad is unable to properly control tasks using the podman driver.
When I try to restart a task/allocation I can see that the container is stopped on the host but it is never restarted.
In the web ui of nomad I can see the tasks stuck in various states. Pending or queueud or running. Whenever I try to stop/restart the allocation the ui confirms my action but when I reload e.g. the state is back to "running" even if there is no container running on the machine. I have count=1
in my job but since nomad fails to figure out what it going on it then creates more allocations which leads to a deadlock because resources are exhausted and there is no proper way to recover from there. The allocation is stuck. It can't be stopped or removed or anything.
I can stop the conainer on the machine and restart the nomad service and eventually it will figure out that the job is dead. Then I can start it exactly one time and if I want to restart the same loop starts again.
There are no errors visible in the logs.
See the attached screenhot: One allocation is starting, one is queud and one is running. But actually not a single container is running on the machine.
Reproduction steps
Run a job/task using podman driver and try to restart/stop it.
Expected Result
The container should restart
Actual Result
The container stops and the allocation gets stuck in some state.
Job file (if appropriate)
{
"Stop": false,
"Region": "global",
"Namespace": "default",
"ID": "remote-traefik",
"ParentID": "",
"Name": "remote-traefik",
"Type": "service",
"Priority": 50,
"AllAtOnce": false,
"Datacenters": [
"remote"
],
"Constraints": null,
"Affinities": null,
"Spreads": null,
"TaskGroups": [
{
"Name": "traefik",
"Count": 1,
"Update": {
"Stagger": 30000000000,
"MaxParallel": 1,
"HealthCheck": "checks",
"MinHealthyTime": 10000000000,
"HealthyDeadline": 300000000000,
"ProgressDeadline": 600000000000,
"AutoRevert": false,
"AutoPromote": false,
"Canary": 0
},
"Migrate": {
"MaxParallel": 1,
"HealthCheck": "checks",
"MinHealthyTime": 10000000000,
"HealthyDeadline": 300000000000
},
"Constraints": [
{
"LTarget": "${attr.consul.version}",
"RTarget": ">= 1.7.0",
"Operand": "semver"
}
],
"Scaling": null,
"RestartPolicy": {
"Attempts": 2,
"Interval": 1800000000000,
"Delay": 15000000000,
"Mode": "fail"
},
"Tasks": [
{
"Name": "traefik",
"Driver": "podman",
"User": "",
"Config": {
"volumes": [
"/opt/traefik/conf:/etc/traefik",
"/opt/traefik/acme:/etc/traefik/acme"
],
"image": "docker://docker.io/traefik:v2.9",
"ports": [
"http",
"https",
"traefik"
],
"force_pull": true
},
"Env": {
},
"Services": null,
"Vault": null,
"Templates": null,
"Constraints": null,
"Affinities": null,
"Resources": {
"CPU": 500,
"Cores": 0,
"MemoryMB": 1024,
"MemoryMaxMB": 0,
"DiskMB": 0,
"IOPS": 0,
"Networks": null,
"Devices": null
},
"RestartPolicy": {
"Attempts": 2,
"Interval": 1800000000000,
"Delay": 15000000000,
"Mode": "fail"
},
"DispatchPayload": null,
"Lifecycle": null,
"Meta": null,
"KillTimeout": 5000000000,
"LogConfig": {
"MaxFiles": 10,
"MaxFileSizeMB": 10
},
"Artifacts": null,
"Leader": false,
"ShutdownDelay": 0,
"VolumeMounts": null,
"ScalingPolicies": null,
"KillSignal": "",
"Kind": "",
"CSIPluginConfig": null
}
],
"EphemeralDisk": {
"Sticky": false,
"SizeMB": 300,
"Migrate": false
},
"Meta": null,
"ReschedulePolicy": {
"Attempts": 0,
"Interval": 0,
"Delay": 30000000000,
"DelayFunction": "exponential",
"MaxDelay": 3600000000000,
"Unlimited": true
},
"Affinities": null,
"Spreads": null,
"Networks": [
{
"Mode": "",
"Device": "",
"CIDR": "",
"IP": "",
"Hostname": "",
"MBits": 0,
"DNS": null,
"ReservedPorts": [
{
"Label": "http",
"Value": 8080,
"To": 0,
"HostNetwork": "default"
},
{
"Label": "https",
"Value": 4434,
"To": 0,
"HostNetwork": "default"
},
{
"Label": "traefik",
"Value": 8081,
"To": 0,
"HostNetwork": "default"
}
],
"DynamicPorts": null
}
],
"Consul": {
"Namespace": ""
},
"Services": [
{
"Name": "remote-traefik-traefik",
"TaskName": "",
"PortLabel": "traefik",
"AddressMode": "auto",
"Address": "",
"EnableTagOverride": false,
"Tags": null,
"CanaryTags": null,
"Checks": null,
"Connect": null,
"Meta": null,
"CanaryMeta": null,
"TaggedAddresses": null,
"Namespace": "default",
"OnUpdate": "require_healthy",
"Provider": "consul"
}
],
"Volumes": null,
"ShutdownDelay": null,
"StopAfterClientDisconnect": null,
"MaxClientDisconnect": null
}
],
"Update": {
"Stagger": 30000000000,
"MaxParallel": 1,
"HealthCheck": "",
"MinHealthyTime": 0,
"HealthyDeadline": 0,
"ProgressDeadline": 0,
"AutoRevert": false,
"AutoPromote": false,
"Canary": 0
},
"Multiregion": null,
"Periodic": null,
"ParameterizedJob": null,
"Dispatched": false,
"DispatchIdempotencyToken": "",
"Payload": null,
"Meta": null,
"ConsulToken": "",
"ConsulNamespace": "",
"VaultToken": "",
"VaultNamespace": "",
"NomadTokenID": "",
"Status": "pending",
"StatusDescription": "",
"Stable": true,
"Version": 17,
"SubmitTime": 1665528847881678800,
"CreateIndex": 16221,
"ModifyIndex": 17316,
"JobModifyIndex": 16662
}
I know the description is a bit vague. Happy to provide more details
As you can see in the screenshort it says "1 Starting". Clicking on it reveals: nothing
Clicking on the "Running" allocation I get this:
Allocation is running, last event says "Sent interrupt..." and on the node:
podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
podman ps -a
...
9169f4c384e1 docker.io/library/traefik:v2.9 traefik 9 hours ago Exited (0) 18 minutes ago 10.23.1.1:4434->4434/tcp, 10.23.1.1:8080-8081->8080-8081/tcp, 10.23.1.1:4434->4434/udp, 10.23.1.1:8080-8081->8080-8081/udp traefik-e9c2e366-8718-ffeb-651f-3aaa8bf11233
The container is even there, stopped. But nomad seems absolutely unable to do a podman restart
.
I can do that manually and the container comes up but this does not make nomad recognise it.
Another example: UI reports task as running. No container is running.
Trying to stop it via cli:
nomad alloc stop 66743a70
==> 2022-10-12T09:40:10+02:00: Monitoring evaluation "4f2b5634"
2022-10-12T09:40:10+02:00: Evaluation triggered by job "remote-matrix"
2022-10-12T09:40:11+02:00: Evaluation status changed: "pending" -> "complete"
==> 2022-10-12T09:40:11+02:00: Evaluation "4f2b5634" finished with status "complete" but failed to place all allocations:
2022-10-12T09:40:11+02:00: Task Group "matrix" (failed to place 1 allocation):
* Resources exhausted on 1 nodes
* Dimension "network: reserved port collision http=8008" exhausted on 1 nodes
2022-10-12T09:40:11+02:00: Evaluation "118c97e0" waiting for additional capacity to place remainder
root@nomad-worker-1:~# nomad alloc stop 66743a70
==> 2022-10-12T09:40:27+02:00: Monitoring evaluation "6fbbd32f"
2022-10-12T09:40:27+02:00: Evaluation triggered by job "remote-matrix"
2022-10-12T09:40:28+02:00: Evaluation status changed: "pending" -> "complete"
==> 2022-10-12T09:40:28+02:00: Evaluation "6fbbd32f" finished with status "complete" but failed to place all allocations:
2022-10-12T09:40:28+02:00: Task Group "matrix" (failed to place 1 allocation):
* Resources exhausted on 1 nodes
* Dimension "network: reserved port collision http=8008" exhausted on 1 nodes
2022-10-12T09:40:28+02:00: Evaluation "122bd553" waiting for additional capacity to place remainder
Running the same command twice, and everytime it says Evaluation status changed: "pending" -> "complete"
but nothing changes.
I also don't understand why the evaluation to stop the task complains about resource exhaustion. If the allocation would properly be stopped then the following allocation could succeed because the port would be free. But for me everyting is just stuck without any error message (that I could find so far)
nomad job stop remote-traefik
==> 2022-10-12T09:47:43+02:00: Monitoring evaluation "035d952c"
2022-10-12T09:47:44+02:00: Evaluation triggered by job "remote-traefik"
2022-10-12T09:47:45+02:00: Evaluation status changed: "pending" -> "complete"
==> 2022-10-12T09:47:45+02:00: Evaluation "035d952c" finished with status "complete"
root@nomad-worker-1:~# nomad job status
ID Type Priority Status Submit Date
remote-matrix service 50 dead (stopped) 2022-10-12T01:15:04+02:00
remote-traefik service 50 dead (stopped) 2022-10-12T00:54:07+02:00
Ok, I stopped the job and it's status is dead
.
Looking in the UI I still have 1 running and 1 starting job. Which again seems to lead to resource exhaustion because the allocation is using the port.
How can the reporting of state be so differnt I wonder?
I'm seeing the same. I think in my case the container is crashing on startup (due to having a dependency on a consul service registration from another nomad job which hasn't gone healthy by the time this job starts up. Normally, a retry would happen after the dependent task is running, and everything would be good). Nomad doesn't notice the container has crashed and the alloc remains in the running state.
Any attempt to recover by stopping/restarting the alloc receives errors from the podman api, because the container nomad is trying to stop is not running. Nomad doesn't transition the alloc from the running state into any other state (e.g. lost, failed), and keeps it in the running state.
Attempting a nomad job stop -purge
, and re-running the job also fails, due to a port collision, even though
a) there are no longer any running jobs from nomad's perspective, because the previous job was purged and
b) there is no container using those ports
(nomad 1.4.1, podman 4.1.1)
EDIT: Downgrading nomad to 1.3.5 and podman from 4.1.1 to 4.0.2 immediately resolved the problems. I tried upgrading nomad back to 1.4.1 with podman still on 4.0.2, but still had issues (possibly different ones, podman commands were hanging while nomad was trying to start containers).
Hi @soupdiver and @optiz0r and thanks for raising this issue with all the context.
Same as https://github.com/hashicorp/nomad-driver-podman/issues/182? Note that the fix is only in the nightly version of nomad-driver-podman
Same as hashicorp/nomad-driver-podman#182? Note that the fix is only in the nightly version of nomad-driver-podman
ah nice thanks! Will try that. Any reason this does not trigger a new release of the driver?
Yup, the latest nightly has resolved things for me too, thanks.