nomad-driver-podman Performance and scaling nomad-driver-podman under high allocation loads

I have been using the Podman driver for my workloads and my current project involves a very large Nomad cluster and launching 16,000 containers across the cluster. I have noticed some performance and scaling issues that have impacted my deployment of the workloads. I am wondering if there are any specific steps I could take to improve the stability of my deployments and optimize the number of containers per client node. Here are two major issues I am setting when launching jobs in batches of 4000:

The deployment will overload allocations on a select number of client nodes where some will one or two containers running and others will have 50+. This seems to cause the second issue.
The Podman socket socket, I assume, gets overloaded. Under high allocation load the Podman driver becomes unavailable to in the Web UI and allocations start failing.

The failed allocations tend to snowball a client node into an usable state because the Podman socket cannot fully recover to accept new allocations. The leads to a large amount of failed allocations.

Does anyone have any recommendation for changing my jobs so they spread out more evenly across my client nodes? I think I need to have more time between container start. I am using these settings in my job:

update {
  stagger = "30s"
  max_parallel = 1
  min_healthy_time = "15s"
  progress_deadline = "30m"
}

restart {
  attempts = 10
  interval = "30m"
  delay    = "2m"
  mode     = "fail"
}

scaling {
  enabled = true
  min     = 0
  max     = 20000
}

Also, any thoughts on why the podman socket gets overwhelmed by the driver? My client nodes use Fedora CoreOS which has pretty decent sysctl settings out of the box and I am using the Nomad recommended settings as well:

- path: /etc/sysctl.d/30-nomad-bridge-iptables.conf
    contents:
      inline: |
        net.bridge.bridge-nf-call-arptables=1
        net.bridge.bridge-nf-call-ip6tables=1
        net.bridge.bridge-nf-call-iptables=1
  - path: /etc/sysctl.d/31-nomad-dynamic-ports.conf
    contents:
      inline: |
        net.ipv4.ip_local_port_range=49152 65535
  - path: /etc/sysctl.d/32-nomad-max-user.conf
    contents:
      inline: |
        fs.inotify.max_user_instances=16384
        fs.inotify.max_user_watches=1048576
  - path: /etc/sysctl.d/33-nomad-nf-conntrack-max.conf
    contents:
      inline: |
        net.netfilter.nf_conntrack_max = 524288

$ cat /proc/sys/fs/file-max
9223372036854775807

Does anyone else use the Podman driver for high allocation workloads?

Jun 20 '22 21:06 jdoss

Not a solution here, but I have seen a similar behaviour to your second case with a smaller cluster.

Most of the time, the podman socket CPU usage is very high even when most of the services are idle. The more services that are running on a node the greater the CPU usage is. If any CPU intensive task happens, the socket stops responding and allocations start failing for a while. From a quick diagnostic, most of the logs are healthchecks so it might be that the podman socket can't handle too many requests at the same time.

I haven't had the chance to debug further but it might be a bug on podman's service. I don't use docker so I don't know how it behaves, but I doubt this behaviour is normal there since I have seen docker machines running more containers.

Jun 25 '22 22:06 rina-spinne

@jdoss regarding problem 1: i think this is not directly related to this driver. Nomad uses, by default, the bin-pack strategy to run tasks. It will always try to fill-up a node before it considers another one. An alternative is the so called spread scheduler, it will evenly distribute work.

Problem 2: i am aware of this problem, it happens also in our environment. Root cause is for now unclear and we have a systemd-timer to check/cleanup the socket periodically as a workaround.

@rina-spinne getting metrics/stats from a single container is somewhat expensive. Running many containers concurrently and polling stats in a frequent pace can quickly cause quite some load. Maybe you can tune the collection_interval configuration option? It has an aggressive default of just 1 second. A good solution is to align it with you metric collector interval. This way you end up with a 30s or 60s for a typical prometheus setup.

Jun 27 '22 06:06 towe75

@jdoss regarding problem 1: i think this is not directly related to this driver. Nomad uses, by default, the bin-pack strategy to run tasks. It will always try to fill-up a node before it considers another one. An alternative is the so called spread scheduler, it will evenly distribute work.

Thanks for this tip. I will modify my jobs to see if I can spread things out to prevent the socket from getting overloaded and report back.

Problem 2: i am aware of this problem, it happens also in our environment. Root cause is for now unclear and we have a systemd-timer to check/cleanup the socket periodically as a workaround.

Could you be able to share this unit and timer?

Jul 14 '22 14:07 jdoss

@towe75 @rina-spinne I opened https://github.com/containers/podman/issues/14941 to see if the Podman team has any thoughts on this issue. If you have any additional context to add to that issue, I am sure that would help track things down.

Jul 14 '22 15:07 jdoss

Maybe related https://github.com/hashicorp/nomad/issues/16246

Mar 02 '23 15:03 jdoss

@jdoss i do not think that it's related. I don't know enough about your environment to recommend something. A rule of thumb in our cluster is: keep number of containers below 70 for a 2 core machine (e.g. AWS m5a.large etc.). We found that the overhead for logging, scraping, process management etc. is rather high when going above 100 containers on such a node. But this depends, of course, on a lot of things and is likely not be true for your workload.

Mar 10 '23 20:03 towe75