nomad Problems when running tasks using podman driver (Inconsistent UI and allocations are stuck)

Problems when running tasks using podman driver (Inconsistent UI and allocations are stuck)

Open soupdiver opened this issue 1 year ago • 7 comments

Nomad version

Nomad v1.4.1 (2aa7e66bdb526e25f59883952d74dad7ea9a014e) agent: detected plugin: name=podman type=driver plugin_version=0.4.0

Operating system and Environment details

Linux nomad-worker-1 5.19.7-1-pve #1 SMP PREEMPT_DYNAMIC PVE 5.19.7-1 (Tue, 06 Sep 2022 07:54:58 + x86_64 x86_64 x86_64 GNU/Linux

Issue

Inconsistent reporting and wrong behaviour of tasks using podman driver. It seems Nomad is unable to properly control tasks using the podman driver. When I try to restart a task/allocation I can see that the container is stopped on the host but it is never restarted. In the web ui of nomad I can see the tasks stuck in various states. Pending or queueud or running. Whenever I try to stop/restart the allocation the ui confirms my action but when I reload e.g. the state is back to "running" even if there is no container running on the machine. I have count=1 in my job but since nomad fails to figure out what it going on it then creates more allocations which leads to a deadlock because resources are exhausted and there is no proper way to recover from there. The allocation is stuck. It can't be stopped or removed or anything. I can stop the conainer on the machine and restart the nomad service and eventually it will figure out that the job is dead. Then I can start it exactly one time and if I want to restart the same loop starts again. There are no errors visible in the logs. See the attached screenhot: One allocation is starting, one is queud and one is running. But actually not a single container is running on the machine.

Reproduction steps

Run a job/task using podman driver and try to restart/stop it.

Expected Result

The container should restart

Actual Result

The container stops and the allocation gets stuck in some state.

Job file (if appropriate)

{
  "Stop": false,
  "Region": "global",
  "Namespace": "default",
  "ID": "remote-traefik",
  "ParentID": "",
  "Name": "remote-traefik",
  "Type": "service",
  "Priority": 50,
  "AllAtOnce": false,
  "Datacenters": [
    "remote"
  ],
  "Constraints": null,
  "Affinities": null,
  "Spreads": null,
  "TaskGroups": [
    {
      "Name": "traefik",
      "Count": 1,
      "Update": {
        "Stagger": 30000000000,
        "MaxParallel": 1,
        "HealthCheck": "checks",
        "MinHealthyTime": 10000000000,
        "HealthyDeadline": 300000000000,
        "ProgressDeadline": 600000000000,
        "AutoRevert": false,
        "AutoPromote": false,
        "Canary": 0
      },
      "Migrate": {
        "MaxParallel": 1,
        "HealthCheck": "checks",
        "MinHealthyTime": 10000000000,
        "HealthyDeadline": 300000000000
      },
      "Constraints": [
        {
          "LTarget": "${attr.consul.version}",
          "RTarget": ">= 1.7.0",
          "Operand": "semver"
        }
      ],
      "Scaling": null,
      "RestartPolicy": {
        "Attempts": 2,
        "Interval": 1800000000000,
        "Delay": 15000000000,
        "Mode": "fail"
      },
      "Tasks": [
        {
          "Name": "traefik",
          "Driver": "podman",
          "User": "",
          "Config": {
            "volumes": [
              "/opt/traefik/conf:/etc/traefik",
              "/opt/traefik/acme:/etc/traefik/acme"
            ],
            "image": "docker://docker.io/traefik:v2.9",
            "ports": [
              "http",
              "https",
              "traefik"
            ],
            "force_pull": true
          },
          "Env": {
          },
          "Services": null,
          "Vault": null,
          "Templates": null,
          "Constraints": null,
          "Affinities": null,
          "Resources": {
            "CPU": 500,
            "Cores": 0,
            "MemoryMB": 1024,
            "MemoryMaxMB": 0,
            "DiskMB": 0,
            "IOPS": 0,
            "Networks": null,
            "Devices": null
          },
          "RestartPolicy": {
            "Attempts": 2,
            "Interval": 1800000000000,
            "Delay": 15000000000,
            "Mode": "fail"
          },
          "DispatchPayload": null,
          "Lifecycle": null,
          "Meta": null,
          "KillTimeout": 5000000000,
          "LogConfig": {
            "MaxFiles": 10,
            "MaxFileSizeMB": 10
          },
          "Artifacts": null,
          "Leader": false,
          "ShutdownDelay": 0,
          "VolumeMounts": null,
          "ScalingPolicies": null,
          "KillSignal": "",
          "Kind": "",
          "CSIPluginConfig": null
        }
      ],
      "EphemeralDisk": {
        "Sticky": false,
        "SizeMB": 300,
        "Migrate": false
      },
      "Meta": null,
      "ReschedulePolicy": {
        "Attempts": 0,
        "Interval": 0,
        "Delay": 30000000000,
        "DelayFunction": "exponential",
        "MaxDelay": 3600000000000,
        "Unlimited": true
      },
      "Affinities": null,
      "Spreads": null,
      "Networks": [
        {
          "Mode": "",
          "Device": "",
          "CIDR": "",
          "IP": "",
          "Hostname": "",
          "MBits": 0,
          "DNS": null,
          "ReservedPorts": [
            {
              "Label": "http",
              "Value": 8080,
              "To": 0,
              "HostNetwork": "default"
            },
            {
              "Label": "https",
              "Value": 4434,
              "To": 0,
              "HostNetwork": "default"
            },
            {
              "Label": "traefik",
              "Value": 8081,
              "To": 0,
              "HostNetwork": "default"
            }
          ],
          "DynamicPorts": null
        }
      ],
      "Consul": {
        "Namespace": ""
      },
      "Services": [
        {
          "Name": "remote-traefik-traefik",
          "TaskName": "",
          "PortLabel": "traefik",
          "AddressMode": "auto",
          "Address": "",
          "EnableTagOverride": false,
          "Tags": null,
          "CanaryTags": null,
          "Checks": null,
          "Connect": null,
          "Meta": null,
          "CanaryMeta": null,
          "TaggedAddresses": null,
          "Namespace": "default",
          "OnUpdate": "require_healthy",
          "Provider": "consul"
        }
      ],
      "Volumes": null,
      "ShutdownDelay": null,
      "StopAfterClientDisconnect": null,
      "MaxClientDisconnect": null
    }
  ],
  "Update": {
    "Stagger": 30000000000,
    "MaxParallel": 1,
    "HealthCheck": "",
    "MinHealthyTime": 0,
    "HealthyDeadline": 0,
    "ProgressDeadline": 0,
    "AutoRevert": false,
    "AutoPromote": false,
    "Canary": 0
  },
  "Multiregion": null,
  "Periodic": null,
  "ParameterizedJob": null,
  "Dispatched": false,
  "DispatchIdempotencyToken": "",
  "Payload": null,
  "Meta": null,
  "ConsulToken": "",
  "ConsulNamespace": "",
  "VaultToken": "",
  "VaultNamespace": "",
  "NomadTokenID": "",
  "Status": "pending",
  "StatusDescription": "",
  "Stable": true,
  "Version": 17,
  "SubmitTime": 1665528847881678800,
  "CreateIndex": 16221,
  "ModifyIndex": 17316,
  "JobModifyIndex": 16662
}

I know the description is a bit vague. Happy to provide more details

As you can see in the screenshort it says "1 Starting". Clicking on it reveals: nothing

Clicking on the "Running" allocation I get this:

Allocation is running, last event says "Sent interrupt..." and on the node:

podman ps
CONTAINER ID  IMAGE       COMMAND     CREATED     STATUS      PORTS       NAMES

podman ps -a
...
9169f4c384e1  docker.io/library/traefik:v2.9             traefik               9 hours ago   Exited (0) 18 minutes ago            10.23.1.1:4434->4434/tcp, 10.23.1.1:8080-8081->8080-8081/tcp, 10.23.1.1:4434->4434/udp, 10.23.1.1:8080-8081->8080-8081/udp  traefik-e9c2e366-8718-ffeb-651f-3aaa8bf11233

The container is even there, stopped. But nomad seems absolutely unable to do a podman restart. I can do that manually and the container comes up but this does not make nomad recognise it.

Oct 12 '22 07:10 soupdiver

Another example: UI reports task as running. No container is running.

Trying to stop it via cli:

nomad alloc stop 66743a70
==> 2022-10-12T09:40:10+02:00: Monitoring evaluation "4f2b5634"
    2022-10-12T09:40:10+02:00: Evaluation triggered by job "remote-matrix"
    2022-10-12T09:40:11+02:00: Evaluation status changed: "pending" -> "complete"
==> 2022-10-12T09:40:11+02:00: Evaluation "4f2b5634" finished with status "complete" but failed to place all allocations:
    2022-10-12T09:40:11+02:00: Task Group "matrix" (failed to place 1 allocation):
      * Resources exhausted on 1 nodes
      * Dimension "network: reserved port collision http=8008" exhausted on 1 nodes
    2022-10-12T09:40:11+02:00: Evaluation "118c97e0" waiting for additional capacity to place remainder
root@nomad-worker-1:~# nomad alloc stop 66743a70
==> 2022-10-12T09:40:27+02:00: Monitoring evaluation "6fbbd32f"
    2022-10-12T09:40:27+02:00: Evaluation triggered by job "remote-matrix"
    2022-10-12T09:40:28+02:00: Evaluation status changed: "pending" -> "complete"
==> 2022-10-12T09:40:28+02:00: Evaluation "6fbbd32f" finished with status "complete" but failed to place all allocations:
    2022-10-12T09:40:28+02:00: Task Group "matrix" (failed to place 1 allocation):
      * Resources exhausted on 1 nodes
      * Dimension "network: reserved port collision http=8008" exhausted on 1 nodes
    2022-10-12T09:40:28+02:00: Evaluation "122bd553" waiting for additional capacity to place remainder

Running the same command twice, and everytime it says Evaluation status changed: "pending" -> "complete" but nothing changes.

I also don't understand why the evaluation to stop the task complains about resource exhaustion. If the allocation would properly be stopped then the following allocation could succeed because the port would be free. But for me everyting is just stuck without any error message (that I could find so far)

Oct 12 '22 07:10 soupdiver

nomad job stop remote-traefik
==> 2022-10-12T09:47:43+02:00: Monitoring evaluation "035d952c"
    2022-10-12T09:47:44+02:00: Evaluation triggered by job "remote-traefik"
    2022-10-12T09:47:45+02:00: Evaluation status changed: "pending" -> "complete"
==> 2022-10-12T09:47:45+02:00: Evaluation "035d952c" finished with status "complete"
root@nomad-worker-1:~# nomad job status
ID              Type     Priority  Status          Submit Date
remote-matrix   service  50        dead (stopped)  2022-10-12T01:15:04+02:00
remote-traefik  service  50        dead (stopped)  2022-10-12T00:54:07+02:00

Ok, I stopped the job and it's status is dead. Looking in the UI I still have 1 running and 1 starting job. Which again seems to lead to resource exhaustion because the allocation is using the port.

How can the reporting of state be so differnt I wonder?

Oct 12 '22 07:10 soupdiver

I'm seeing the same. I think in my case the container is crashing on startup (due to having a dependency on a consul service registration from another nomad job which hasn't gone healthy by the time this job starts up. Normally, a retry would happen after the dependent task is running, and everything would be good). Nomad doesn't notice the container has crashed and the alloc remains in the running state.

Any attempt to recover by stopping/restarting the alloc receives errors from the podman api, because the container nomad is trying to stop is not running. Nomad doesn't transition the alloc from the running state into any other state (e.g. lost, failed), and keeps it in the running state.

Attempting a nomad job stop -purge, and re-running the job also fails, due to a port collision, even though a) there are no longer any running jobs from nomad's perspective, because the previous job was purged and b) there is no container using those ports

(nomad 1.4.1, podman 4.1.1)

EDIT: Downgrading nomad to 1.3.5 and podman from 4.1.1 to 4.0.2 immediately resolved the problems. I tried upgrading nomad back to 1.4.1 with podman still on 4.0.2, but still had issues (possibly different ones, podman commands were hanging while nomad was trying to start containers).

Oct 15 '22 22:10 optiz0r

Hi @soupdiver and @optiz0r and thanks for raising this issue with all the context.

Oct 17 '22 08:10 jrasell

Same as https://github.com/hashicorp/nomad-driver-podman/issues/182? Note that the fix is only in the nightly version of nomad-driver-podman

Oct 29 '22 04:10 ntkoopman

Same as hashicorp/nomad-driver-podman#182? Note that the fix is only in the nightly version of nomad-driver-podman

ah nice thanks! Will try that. Any reason this does not trigger a new release of the driver?

Oct 29 '22 06:10 soupdiver

Yup, the latest nightly has resolved things for me too, thanks.

Oct 29 '22 15:10 optiz0r

nomad nomad copied to clipboard

Problems when running tasks using podman driver (Inconsistent UI and allocations are stuck)

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

nomad
nomad copied to clipboard