Performance and scaling nomad-driver-podman under high allocation loads
I have been using the Podman driver for my workloads and my current project involves a very large Nomad cluster and launching 16,000 containers across the cluster. I have noticed some performance and scaling issues that have impacted my deployment of the workloads. I am wondering if there are any specific steps I could take to improve the stability of my deployments and optimize the number of containers per client node. Here are two major issues I am setting when launching jobs in batches of 4000:
- The deployment will overload allocations on a select number of client nodes where some will one or two containers running and others will have 50+. This seems to cause the second issue.
- The Podman socket socket, I assume, gets overloaded. Under high allocation load the Podman driver becomes unavailable to in the Web UI and allocations start failing.
The failed allocations tend to snowball a client node into an usable state because the Podman socket cannot fully recover to accept new allocations. The leads to a large amount of failed allocations.
Does anyone have any recommendation for changing my jobs so they spread out more evenly across my client nodes? I think I need to have more time between container start. I am using these settings in my job:
update {
stagger = "30s"
max_parallel = 1
min_healthy_time = "15s"
progress_deadline = "30m"
}
restart {
attempts = 10
interval = "30m"
delay = "2m"
mode = "fail"
}
scaling {
enabled = true
min = 0
max = 20000
}
Also, any thoughts on why the podman socket gets overwhelmed by the driver? My client nodes use Fedora CoreOS which has pretty decent sysctl settings out of the box and I am using the Nomad recommended settings as well:
- path: /etc/sysctl.d/30-nomad-bridge-iptables.conf
contents:
inline: |
net.bridge.bridge-nf-call-arptables=1
net.bridge.bridge-nf-call-ip6tables=1
net.bridge.bridge-nf-call-iptables=1
- path: /etc/sysctl.d/31-nomad-dynamic-ports.conf
contents:
inline: |
net.ipv4.ip_local_port_range=49152 65535
- path: /etc/sysctl.d/32-nomad-max-user.conf
contents:
inline: |
fs.inotify.max_user_instances=16384
fs.inotify.max_user_watches=1048576
- path: /etc/sysctl.d/33-nomad-nf-conntrack-max.conf
contents:
inline: |
net.netfilter.nf_conntrack_max = 524288
$ cat /proc/sys/fs/file-max
9223372036854775807
Does anyone else use the Podman driver for high allocation workloads?
Not a solution here, but I have seen a similar behaviour to your second case with a smaller cluster.
Most of the time, the podman socket CPU usage is very high even when most of the services are idle. The more services that are running on a node the greater the CPU usage is. If any CPU intensive task happens, the socket stops responding and allocations start failing for a while. From a quick diagnostic, most of the logs are healthchecks so it might be that the podman socket can't handle too many requests at the same time.
I haven't had the chance to debug further but it might be a bug on podman's service. I don't use docker so I don't know how it behaves, but I doubt this behaviour is normal there since I have seen docker machines running more containers.
@jdoss regarding problem 1: i think this is not directly related to this driver. Nomad uses, by default, the bin-pack strategy to run tasks. It will always try to fill-up a node before it considers another one. An alternative is the so called spread scheduler, it will evenly distribute work.
Problem 2: i am aware of this problem, it happens also in our environment. Root cause is for now unclear and we have a systemd-timer to check/cleanup the socket periodically as a workaround.
@rina-spinne getting metrics/stats from a single container is somewhat expensive. Running many containers concurrently and polling stats in a frequent pace can quickly cause quite some load. Maybe you can tune the collection_interval configuration option? It has an aggressive default of just 1 second. A good solution is to align it with you metric collector interval. This way you end up with a 30s or 60s for a typical prometheus setup.
@jdoss regarding problem 1: i think this is not directly related to this driver. Nomad uses, by default, the bin-pack strategy to run tasks. It will always try to fill-up a node before it considers another one. An alternative is the so called spread scheduler, it will evenly distribute work.
Thanks for this tip. I will modify my jobs to see if I can spread things out to prevent the socket from getting overloaded and report back.
Problem 2: i am aware of this problem, it happens also in our environment. Root cause is for now unclear and we have a systemd-timer to check/cleanup the socket periodically as a workaround.
Could you be able to share this unit and timer?
@towe75 @rina-spinne I opened https://github.com/containers/podman/issues/14941 to see if the Podman team has any thoughts on this issue. If you have any additional context to add to that issue, I am sure that would help track things down.
Maybe related https://github.com/hashicorp/nomad/issues/16246
@jdoss i do not think that it's related. I don't know enough about your environment to recommend something. A rule of thumb in our cluster is: keep number of containers below 70 for a 2 core machine (e.g. AWS m5a.large etc.). We found that the overhead for logging, scraping, process management etc. is rather high when going above 100 containers on such a node. But this depends, of course, on a lot of things and is likely not be true for your workload.