nomad-driver-containerd
nomad-driver-containerd copied to clipboard
The same image seems to be pulled in parallel causing disk exhaustion
We have about 100 parameterized job definitions that use the same image
config:
config {
image = "username/backend:some_tag"
The problem is that disk space is exhausted on Nomad clients and it looks like the reason is that the image is being pulled individually for each job, despite specifying the same exact image with the same tag. When using docker
Nomad driver this didn't happen and all jobs made use of a single image that was pulled and extracted once.
I might be wrong on the explanation but this is what I get from multiple (hundreds) of error messages like:
[ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=62ab19a7-4e67-c941-cc39-340394800fa1 task=main error="rpc error: code = Unknown desc = Error in pulling image username/backend:some_tag: failed to prepare extraction snapshot \"extract-138110298-tmpn sha256:bf868a0e662ae83512efeacb6deb2e0f0f1694e693fab8f53c110cb503c00b99\": context deadline exceeded"
I.e. it looks like each allocation has it's own extraction snapshot
? Is it possible to configure the driver (or containerd
) so that all jobs will share a single image snapshot?
I noticed this was just implemented in the podman
driver and it looks simple, so maybe it can be reused for containerd
: https://github.com/hashicorp/nomad-driver-podman/commit/40db1ef0c5af9f2aff7829449af3d950b8ff59b9?diff=unified
@aartur I am not able to reproduce this. I tried to launch 10 jobs with same image golang:latest
(Image size is ~1GB)
Before I launched the jobs (~55 GB of disk space)
vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 967M 0 967M 0% /dev
tmpfs 200M 6.5M 193M 4% /run
/dev/mapper/vagrant--vg-root 62G 4.4G 55G 8% /
$ nomad job status
root@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd/example# nomad status
ID Type Priority Status Submit Date
golang service 50 running 2022-08-01T17:57:07Z
golang-1 service 50 running 2022-08-01T17:57:45Z
golang-2 service 50 running 2022-08-01T17:58:06Z
golang-3 service 50 running 2022-08-01T17:58:51Z
golang-4 service 50 running 2022-08-01T17:59:03Z
golang-5 service 50 running 2022-08-01T17:59:09Z
golang-6 service 50 running 2022-08-01T17:59:22Z
golang-7 service 50 pending 2022-08-01T17:59:29Z
golang-8 service 50 pending 2022-08-01T17:59:34Z
golang-9 service 50 pending 2022-08-01T17:59:39Z
NOTE: the pending ones are because the memory is exhausted on my VM and nomad is not able to place those allocations.
After the jobs are running, disk space is still ~55 GB
vagrant@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 967M 0 967M 0% /dev
tmpfs 200M 6.5M 193M 4% /run
/dev/mapper/vagrant--vg-root 62G 4.4G 55G 8% /
Also, I checked using nerdctl
I only see one image.
root@vagrant:~/go/src/github.com/Roblox/nomad-driver-containerd/example# nerdctl images
REPOSITORY TAG IMAGE ID CREATED SIZE
golang latest 19dde56d2309 30 minutes ago 1000.0 MiB
I'm able to reproduce it by submitting 100 jobs with the following bash script:
#!/bin/bash
for i in $(seq 1 100); do
cat << EOT > job.nomad
job "bash_loop_$i" {
datacenters = ["mydatacenter"]
type = "service"
group "main" {
task "main" {
driver = "containerd-driver"
config {
image = "archlinux"
command = "/bin/bash"
args = ["-c", "while [ 1 ]; do sleep 1; done"]
}
resources {
cpu = 100
memory = 30
}
}
}
}
EOT
echo "Running job $i"
nomad job run -detach job.nomad
done
(mydatacenter
needs to be adjusted). When I observe disk space (by running watch -n1 'df -m /'
), I see the disk usage increases by about 25GB. Also I see error messages in logs, e.g.:
containerd[1150]: time="2022-08-08T18:22:32.114740523+02:00" level=error msg="(*service).Write failed" error="rpc error: code = Unavailable desc = ref nomad/1/layer-sha256:e1deda52ffad5c9c8e3b7151625b679af50d6459630f4bf0fbf49e161dba4e88 locked for 15.811395992s (since 2022-08-08 18:22:15.868809674 +0200 CEST m=+478296.835995506): unavailable" expected="sha256:e1deda52ffad5c9c8e3b7151625b679af50d6459630f4bf0fbf49e161dba4e88" ref="layer-sha256:e1deda52ffad5c9c8e3b7151625b679af50d6459630f4bf0fbf49e161dba4e88" total=58926165