nerdctl
nerdctl copied to clipboard
nerdctl ps slows down and errors with 350+ containers
I'm running a load test on my kubernetes cluster using ClusterLoader2, which just runs a bunch of pause containers on each node. When I have around 350 containers, the performance of nerdctl ps is affected compared to ctr:
$ time sudo /usr/local/bin/nerdctl --debug-full -n k8s.io ps
FATA[0020] container "4690dc5561d03a5c89453546a3c5d7c0a7ce7c3938cac1560cf358a2c6c040e9" in namespace "k8s.io": not found
real 0m20.374s
user 0m0.284s
sys 0m0.117s
$ time sudo /usr/local/bin/nerdctl --debug-full -n k8s.io container ls
FATA[0034] container "4a08c9bdba3dc064ad82fbd583f992d409dd8ee3346bfd413ea010cae1f43030" in namespace "k8s.io": not found
real 0m34.376s
user 0m0.284s
sys 0m0.132s
$ time sudo ctr -n k8s.io c ls | wc -l
354
real 0m0.126s
user 0m0.081s
sys 0m0.091s
Also notice that the command fails due to a container being removed while the command was running (which becomes more likely the longer the command takes).
I think the race condition is caused by ps.go calling c.Spec on each container after fetching the list of containers, meaning that if a container is removed before we can inspect it, the command will error. Could be fixed by skipping the removed container rather than erroring if the error is "not found": https://github.com/containerd/nerdctl/blob/cee3b6a4840db6b5dd4019ef343af7bf4ba5c940/cmd/nerdctl/ps.go#L118-L122
Not sure what to do about the performance issue though if we have to make O(n) requests to Spec each container. Maybe we could do some of those requests in parallel?
Maybe we could do some of those requests in parallel?
SGTM.
We should also skip inspecting c.Spec when --quiet is set
I looked a bit into this. In the Docker implementation, only one call to the daemon happens. Here: https://github.com/docker/cli/blob/3dad26ca2d418092b8c4e01b03d0455d583bec86/cli/command/container/list.go#L122
In the nerdctl implementation, we make O(n) calls to c.Spec to achieve the same. My question: is c.Spec a call to containerd? If it's not, it shouldn't slow this operation down. The only place I'm sure makes O(n) calls to containerd is
https://github.com/containerd/nerdctl/blob/e83e18b98e89c7f5948c5777ab3ca0068299e703/cmd/nerdctl/ps.go#L234-L235
But that only happens with --size or --format=wide so this can't be it.
I tried running ~200 nginx containers on my machine and nerdctl ps returns quickly (<2 seconds). I can't reproduce.