sysbox AKS sysbox nodes sporadically enter bad state (`stat failed on /var/lib/containers/storage/overlay/...`)

I'm attempting to port our CI kubernetes cluster to use sysbox to enable building and running docker containers inside our CI jobs.

We're finding that after a number of CI builds, nodes sporadically get into a bad state and are unable to start new containers (pods with status CreateContainerError). For pods/containers that are already running, their local docker daemon gets stuck and the CI build jobs running on them (mostly just docker builds) time out.

We are running roughly 14 pods per node, and we restart our pods (including their local dockerd process) after every job. The cluster is hosted by AKS on Azure. My sample size isn't enormous, but this issue seems to reproduce more frequently on sysbox 0.6.3 (after ~30 builds per node) vs 0.6.2 with SYSBOX_ALLOW_TRUSTED_XATTR set to FALSE (after ~120 builds per node), however it does reproduce on both. The node's kernel version is 5.15.

kubectl describe node doesn't show anything out of the ordinary, the node is ready and hosting pods (though they cannot start) and doesn't have any errors flagged on it.

journalctl -u kubelet does contain error log lines that healthy nodes do not contain, namely many lines that look like:

W0129 09:05:08.825295   17183 container.go:589] Failed to update stats for container "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod753ba6ae_0b8f_47c8_992c_0f9eac91476a.slice/crio-761d1ec685b47ab5c8df26d44bb00c748610eb39c834c7d43383f749ff841835.scope": unable to determine device info for dir: /var/lib/containers/storage/overlay/2e13b53c75be18eb8c3f3773de0c861efe4135b1298157f5d981b3fea38c25ce/diff: stat failed on /var/lib/containers/storage/overlay/2e13b53c75be18eb8c3f3773de0c861efe4135b1298157f5d981b3fea38c25ce/diff with error: no such file or directory, continuing to push stats

journalctl -u crio also contains error log lines that healthy nodes do not have, but it seems just a result of the containers not being able to start:

time="2024-01-29 05:09:24.451212422Z" level=warning msg="Unable to delete container b01fadc160a9a10c4da6ac84c1d6e6082a30119de55eec23a3b9e855d71f6e04: `/usr/bin/sysbox-runc --root /run/runc --systemd-cgroup delete --force b01fadc160a9a10c4da6ac84c1d6e6082a30119de55eec23a3b9e855d71f6e04` failed: time=\"2024-01-29T05:09:24Z\" level=error msg=\"container init still running\"\n : exit status 1" id=3d1dc20d-694b-46b3-a7e4-4f38ffa6939c name=/runtime.v1.RuntimeService/CreateContainer

crictl stop $STUCK_CONTAINER_ID gives the following message:

E0129 09:10:39.944756  458151 remote_runtime.go:505] "StopContainer from runtime service failed" err="rpc error: code = DeadlineExceeded desc = failed to stop container 7f5d0d7dbffa70f5a34a07a62ed25975a1910596819e29803c2b75c413af48fc: context deadline exceeded" containerID="7f5d0d7dbffa7"
FATA[0002] stopping the container "7f5d0d7dbffa7": rpc error: code = DeadlineExceeded desc = failed to stop container 7f5d0d7dbffa70f5a34a07a62ed25975a1910596819e29803c2b75c413af48fc: context deadline exceeded

I don't see anything out of the ordinary in journalctl -u sysbox-mgr or journalctl -u sysbox-fs on the bad node.

Are there more places that I can find potentially helpful information here?

Jan 29 '24 18:01 top-oai

Hi @top-oai,

Thanks for giving Sysbox a shot, hope you find it helpful (after resolving the issues you are having).

A few questions to narrow the problem space:

Are regular pods (i.e., non sysbox) working fine in the cluster?
If you launch a regular privileged pod with Docker inside (without sysbox), does that work fine?
Are the K8s nodes healthy (i.e., disk space & memory)?

For pods/containers that are already running, their local docker daemon gets stuck and the CI build jobs running on them (mostly just docker builds) time out.

Can you exec into one of those sysbox pods and try accessing /proc/uptime and any directory under /proc/sys? Those are emulated by sysbox-fs inside the pod, so I want to see if sysbox-fs is responding or if it got stuck somehow.

Also, inside the pod, create a dir such as /root/testdir and mount tmpfs on it: mount -t tmpfs tmpfs /root/testdir? Since sysbox intercepts that mount syscall, I want to see if that's working OK too.

My sample size isn't enormous, but this issue seems to reproduce more frequently on sysbox 0.6.3 (after ~30 builds per node) vs 0.6.2 with SYSBOX_ALLOW_TRUSTED_XATTR set to FALSE (after ~120 builds per node), however it does reproduce on both.

Interesting, that could be a clue; will keep that in mind as we debug.

The node's kernel version is 5.15.

Can you check if shiftfs is installed on the node? (lsmod | grep shiftfs)

Thanks!

Jan 30 '24 04:01 ctalledo

Hi @ctalledo , I really appreciate the thoughtful response.

Thanks for giving Sysbox a shot, hope you find it helpful (after resolving the issues you are having).

Me too! :)

A few questions to narrow the problem space:

Are regular pods (i.e., non sysbox) working fine in the cluster?

Yes they are fine. And to be clear, other sysbox pods on other nodes continue to run fine also (until they eventually hit the same issue).

If you launch a regular privileged pod with Docker inside (without sysbox), does that work fine?

I'm not sure. It takes quite a few runs and time to get the confidence on whether it works or not, but it's probably worth the sanity check so let me try it out and get back to you.

Are the K8s nodes healthy (i.e., disk space & memory)?

Yes, all nodes in the cluster are healthy. Even the ones in the bad state do not indicate any disk pressure or memory pressure, and I've verified with htop/df.

Can you exec into one of those sysbox pods and try accessing /proc/uptime and any directory under /proc/sys? Those are emulated by sysbox-fs inside the pod, so I want to see if sysbox-fs is responding or if it got stuck somehow.

Also, inside the pod, create a dir such as /root/testdir and mount tmpfs on it: mount -t tmpfs tmpfs /root/testdir? Since sysbox intercepts that mount syscall, I want to see if that's working OK too.

Oh, hmm, I tried to exec in and I see the following message:

FATA[0000] join_namespaces:591 nsenter: failed to open /proc/1090344/ns/ipc: No such file or directory 
FATA[0000] nsexec:809 nsenter: failed to sync with child: next state: Invalid argument 
ERRO[0000] exec failed: container_linux.go:427: starting container process caused: process_linux.go:119: executing setns process caused: exit status 1 
command terminated with exit code 1

and in sysbox-fs logs I see:

2024/01/30 05:27:22 FUSE: -> [ID=0xbe] Getattr valid=2562047h47m16.854775807s ino=5837941533507063541 size=0 mode=dr-xr-xr-x uid=231072 gid=231072
2024/01/30 05:27:22 FUSE: <- Lookup [ID=0xc0 Node=0xb Uid=231072 Gid=231072 Pid=1373073] "kernel"
time="2024-01-30 05:27:22" level=debug msg="Requested Lookup() operation for entry kernel (req ID=0xc0)"
time="2024-01-30 05:27:22" level=debug msg="Executing Lookup() for req-id: 0xc0, handler: ProcSysKernel, resource: kernel"
time="2024-01-30 05:27:22" level=debug msg="Executing Lookup() for req-id: 0xc0, handler: PassThrough, resource: kernel"
time="2024-01-30 05:27:22" level=debug msg="Executing nsenterEvent's SendRequest() method"

I can't remember now if I was ever able to exec into these stuck pods (even though their k8s status is "Running").

My sample size isn't enormous, but this issue seems to reproduce more frequently on sysbox 0.6.3 (after ~30 builds per node) vs 0.6.2 with SYSBOX_ALLOW_TRUSTED_XATTR set to FALSE (after ~120 builds per node), however it does reproduce on both.

Interesting, that could be a clue; will keep that in mind as we debug.

I think my sample size really wasn't big enough to conclude anything there after all, I switched back to 0.6.3 again and needed quite a few attempts to reproduce it there just like I did with 0.6.2. So, I think this is not a clue after all.

Can you check if shiftfs is installed on the node? (lsmod | grep shiftfs)

Yes shiftfs is installed.

Today I managed to get the problem to reproduce while --log-level debug was set. I see a few interesting things:

I see that sysbox-fs logs repeatedly show the lines:

Requested Lookup() operation for entry kernel (req ID=0xc0)
Executing Lookup() for req-id: 0xc0, handler: ProcSysKernel, resource: kernel
Executing Lookup() for req-id: 0xc0, handler: PassThrough, resource: kernel
Executing nsenterEvent's SendRequest() method
Container pre-registration started: id = 5702dfa6e839
Container 5702dfa6e839 will share sysbox-fs state with [0xc000420500 0xc000420a00]
Created fuse server for container 5702dfa6e839f30a576cc15a694ef0297b2b945483d2ff811fc442790ea742ac
Fuse server for container 5702dfa6e839f30a576cc15a694ef0297b2b945483d2ff811fc442790ea742ac shares state with container   92071835791ba743f2890f7740b6d6e24f9e7d6fabfdf4bc949d780857c77d2a
Container pre-registration completed: id = 5702dfa6e839

In our setup pods are frequently restarting as part of their normal operation. I'm not sure if the shares state with container line is a red herring or not, but it appears as though a new container is starting up before the previous container has shutdown?

Note that these shares state with container appear throughout the logs, both before and after the nodes have entered these bad states.

Note also that the logs show the container 92071835791ba743f2890f7740b6d6e24f9e7d6fabfdf4bc949d780857c77d2a gets shutdown 3 minutes after the shares state with container log.

I now recall that at one point I had tried mounting /var/lib/docker to a Kubernetes EmptyDir, and noticed the mount source at <some/host/dir> should be mounted in one container only, but is already mounted in containers <ids> error appearing in the logs because it looked again as if new containers were being created before the old containers were removed. But I'd expect the EmptyDir mount to only ever be mounted to one container at a time. I stopped mounting /var/lib/docker to an EmptyDir because I realized that sysbox already takes care of that mount for me, but perhaps this overlapping containers issue is the problem?

Note also that my Kubernetes cluster version is 1.26.10.

In the sysbox-mgr logs, for every container creation (both before/after becoming unhealthy) I see:

"containerdVolMgr: created volume for container 5702dfa6e839"
"dockerVolMgr: created volume for container 5702dfa6e839"
"autoRemoveCheck: Docker query start for 5702dfa6e839"
"FIXME: Got an status-code for which error does not match any expected type!!!: -1" module=api status_code=-1
"autoRemoveCheck: Docker connection failed for 5702dfa6e839: failed to retrieve Docker info%!(EXTRA errdefs.errUnknown=Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?)"

I'm a bit puzzled by the logged docker errors, the Kubernetes cluster is using containerd. Docker daemons are indeed running within each sysbox pod though, perhaps is this referring to those, then?

Jan 30 '24 05:01 top-oai

If you launch a regular privileged pod with Docker inside (without sysbox), does that work fine?

I'm not sure. It takes quite a few runs and time to get the confidence on whether it works or not, but it's probably worth the sanity check so let me try it out and get back to you.

Okay, I ran a whole bunch of tests in priviledged: true pods. It was easiest to do this by just replacing my sysbox deployment with a "regular" deployment. I ran these pods on the same nodes that the sysbox pods were running on. The "regular" pods were performing the same docker builds as the sysbox pods, but appear to not have the same issue as the sysbox nodes, no nodes entered into a failed state. Not only that, but the regular nodes were able to be scheduled and run successfully on the "bad nodes" from before when the sysbox pods were running. Additionally, after I ran this test, I re-deployed the sysbox pods and they went back to being unable to get start up on the "bad nodes".

Jan 30 '24 07:01 top-oai

Thanks for all the info @top-oai.

I think the fact that the kubectl exec into the sysbox pod failed is problematic and related to the failure you are seeing, so this is what we should focus on.

I need to dig into the code to understand how the exec can cause the errors you saw:

FATA[0000] join_namespaces:591 nsenter: failed to open /proc/1090344/ns/ipc: No such file or directory 
FATA[0000] nsexec:809 nsenter: failed to sync with child: next state: Invalid argument 
ERRO[0000] exec failed: container_linux.go:427: starting container process caused: process_linux.go:119: executing setns process caused: exit status 1 
command terminated with exit code 1

I don't see anything suspicious in the sysbox-fs or sysbox-mgr logs.

Container 5702dfa6e839 will share sysbox-fs state with [0xc000420500 0xc000420a00]: sysbox-fs emulates /proc/sys state per pod, so this simply means that the pod has two containers sharing that state (i.e., the pause container and the main container).
autoRemoveCheck: Docker query start for 5702dfa6e839: sysbox-mgr checks if the higher level runtime is Docker by querying the Docker socket; if it's not there, then it will move on. In this case the higher level runtime is K8s + CRI-O, so there's no Docker and that should be fine. The "FIXME: Got an status-code for which error does not match any expected type log is coming from the Docker API client library used by sysbox-mgr to query Docker (it's harmless).
The mount source at <some/host/dir> should be mounted in one container only, but is already mounted in containers <ids> error occurs when Sysbox detects that the same host-dir is mounted into two containers at /var/lib/docker. It issues this warning because it knows that two Docker engines can't share the same /var/lib/docker (they'll collide on it). Not sure why you saw it with empty-dir, but let's park this since you've removed the mount (given that Sysbox will implicitly mount a host-dir into /var/lib/docker to avoid the overlay-on-overlay problem that occurs if it didn't).

Let me dig into the Sysbox code to see why kubectl exec would fail that way.

In the meantime, could you share your pod spec?

Jan 31 '24 04:01 ctalledo

Thanks for reading through and providing background for everything.

Let me dig into the Sysbox code to see why kubectl exec would fail that way.

Sounds good, thank you!

In the meantime, could you share your pod spec?

Sure, you can find it here.

Jan 31 '24 07:01 top-oai

Hi @top-oai,

Question: when the sysbox pods enter a bad state, what happens if you launch a new Sysbox pod with an alpine image? Are you able to exec into that sysbox pod? Or does it also enter a bad state?

Feb 01 '24 02:02 ctalledo

Okay, I've applied the following pod:

apiVersion: v1
kind: Pod
metadata:
  name: alpine
  namespace: default
  annotations:
    io.kubernetes.cri-o.userns-mode: "auto:size=65536"
spec:
  runtimeClassName: sysbox-runc
  containers:
  - image: alpine:latest
    command:
      - /bin/sh
      - "-c"
      - "echo hi && sleep 60m"
    imagePullPolicy: IfNotPresent
    name: alpine
  restartPolicy: Always
  nodeSelector:
    kubernetes.io/hostname: aks-sysboxcpu1-xbadx-xnodex
  tolerations:
  - key: sysbox-runc
    operator: Exists
    effect: NoSchedule

and I see the familiar results:

% kubectl describe pod alpine

...

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  10s   default-scheduler  Successfully assigned alpine to aks-sysboxcpu1-20143122-vmss00000f
  Normal  Pulled     10s   kubelet            Container image "alpine:latest" already present on machine
  Warning  Failed     26s                  kubelet            Error: create container timeout

at this point I can get it to start by either deploying it to a good node, or by removing runtimeClassName: sysbox-runc from the spec.

I guess it's worth noting that the other images all have Ubuntu as their base.

Are you able to exec into that sysbox pod? Or does it also enter a bad state?

I am not able to exec into it because it can't get past the ContainerCreating state (and eventually lands in the CreateContainerError state after timing out). This is the same state as the pods from above enter into.

Let me see if I can get into this bad state while the pod is already successfully deployed.

Feb 01 '24 16:02 top-oai

Thanks for trying that.

I am not able to exec into it because it can't get past the ContainerCreating state (and eventually lands in the CreateContainerError state after timing out). This is the same state as the pods from above enter into.

The fact that you can't create an alpine pods means Sysbox is in bad shape then.

Any suspicious logs from sysbox-mgr, sysbox-fs, or CRI-O at the time the alpine pod is created?

When you launch a pod, the way it works is:

K8s talks to kubectl
Kubectl talks to CRI-O
CRI-O tells sysbox-runc to create the container
sysbox-runc sets up the container (namespaces, etc.)
sysbox-runc "registers" the container with sysbox-mgr and sysbox-fs
sysbox-runc exits and reports status to CRI-O (which in turn reports it to kubelet).
The container starts running; sysbox-fs monitors activities in the container (accesses to /proc and /sys, traps syscalls, etc.)

The pod creation must be failing in step (3), (4) or (5) above, so look for CRI-O, sysbox-mgr and sysbox-fs logs at the time the alpine pod fails to be created.

Feb 01 '24 17:02 ctalledo

Let me see if I can get into this bad state while the pod is already successfully deployed.

Okay I managed to start an alpine pod on a node, and then later have that node become bad. Once again I can verify that when I try to exec into my regular sysbox pods I see the same thing I posted above (join_namespaces:591 nsenter: failed ...), however I am able to exec into the alpine pod without issue. Here's the output of a couple commands run on that pod: mount, df.

Perhaps the fact that the other regular sysbox pods are running dockerd (and this new alpine pod is not) is why they get stuck?

Any suspicious logs from sysbox-mgr, sysbox-fs, or CRI-O at the time the alpine pod is created?

I'm not seeing much unfortunately, it's not much different than what I've posted above (no smoking guns as far as I can see):

journalctl -u crio logs for the creation and subsequent container creation timeout of the alpine pod.
journalctl -u sysbox-mgr just has time="2024-02-01 16:55:45" level=info msg="registered new container 6e262f5d50d2", and I can't find a corresponding "unregistered". In fact it looks like I don't see any "unregistered" logs after the node has entered a bad state. Note that log-level is back to the default of info now, not debug like above.

journalctl -u sysbox-fs is similarly not very interesting, logging only the container creation:

time="2024-02-01 16:55:45" level=info msg="Container pre-registration completed: id = 6e262f5d50d2"
time="2024-02-01 16:55:45" level=info msg="Container registration completed: id = 6e262f5d50d2, initPid = 2018155, uid:gid = 624288:624288"

So, since registration is being logged as completed, it seems that the issue is arising somewhere around step (5), or maybe between (5) and (6)? As far as I can tell from the pod status (e.g. it never gets past ContainerCreating ) and logs, it doesn't look like the container never actually starts running.

Also, maybe an aside, I'm not sure if it's benign or not, but throughout the sysbox-fs logs I see a lot of:

time="2024-02-01 09:58:56" level=error msg="No namespace file found /proc/48929/ns/user"

as well as a few:

time="2024-02-01 09:56:21" level=info msg="Unexpected error during NotifReceive() execution (no such file or directory) on fd 59, pid 40624, cntr 99578746acba"

and also a few:

time="2024-02-01 16:49:32" level=warning msg="Sysbox-fs first child process error status: exit status 1, pid: 1965267"

Feb 01 '24 23:02 top-oai

Aha, okay, I think I found the cause.

These pods are running CI jobs, and a few of them were failing (consistently). I had originally planned to dive deeper into that problem after this was sorted, but I decided to take another look at them.

It turned out that they were hitting the pod's memory limit and getting oom killed. The node had plenty of memory this whole time, it was just the pods that were hitting their limit. After resolving that, I've run enough tests consecutively now without nodes entering a bad state that I'm quite confident that this was the trigger.

The nodes were compiling C++ code inside of a docker build step when they OOMed. This obviously caused the pod to fail, but it seems as though there was a chance that it would also leave the node in a bad state, causing all the issues from above.

The OOM didn't really show up in any of the logs I was looking at, but I didn't think to look at the node's kernel logs, which is where the pod OOM was reported.

Thanks again for working through this with me, I really appreciate that.

Everything's working great now and I'm unblocked, but it's still a mystery to me how an OOM in the pod might corrupt the node (or at least sysbox-specific state, based on the fact that regular runc containers can still work as usual). Let me know if you'd like me to run other queries on that. The implication here is that adding a memory limit on a pod is worse than not adding one, since OOM in either case has a chance to take out a node, but at least without a memory limit this is less likely to trigger.

Feb 02 '24 08:02 top-oai

Hi @top-oai,

Thanks for the detailed explanation, and very happy to hear you found the issue.

It turned out that they were hitting the pod's memory limit and getting oom killed.

I see; what must be happening then is that the container's init process is getting killed by the kernel's OOM killer. That likely explains the failure to exec into the pod (e.g., join_namespaces:591 nsenter: failed to open /proc/1090344/ns/ipc: No such file or directory) because it's trying to enter into the namespace of a process that is dead. And this error log from sysbox-fs points to the same: time="2024-02-01 09:58:56" level=error msg="No namespace file found /proc/48929/ns/user". And finally, this also explains why sysbox-mgr reports the container is registered, but never unregistered.

BTW, the kernel log in the K8s node should show the OOM kill events IIRC.

Having said that, when the pod's processes are killed by the OOM killer, you would expect that the pod as a whole would be terminated. So there's a bug somewhere, but I don't know if its at the level of Kubectl, CRI-O, or Sysbox. I would need to reproduce and dig deep to see what's going on.

The implication here is that adding a memory limit on a pod is worse than not adding one, since OOM in either case has a chance to take out a node

I thought it's not taking out the full node correct, because you are able to launch a new pod (e.g., the alpine pod experiment we did above) and that works fine. Nonetheless I understand that the OOM kill is creating instability in the node (by killing some pods), so it's best to not use it until we get to the bottom of why.

Feb 03 '24 15:02 ctalledo

It turned out that they were hitting the pod's memory limit and getting oom killed.

I see; what must be happening then is that the container's init process is getting killed by the kernel's OOM killer.

Well, the OOM killer is certainly active but it's not targeting the container's init process. It's instead targeting cc1plus processes, C++ compiler processes that are happening within a docker build, in the docker daemon all running within a pod. These processes are both 1) the processes consuming all the memory and 2) the processes that the OOM killer is targeting. I see this by looking at the kernel logs (e.g. journalctl -k) on the node. The OOM killer does not appear to be targetting the container's init process and I do not see this being reported as OOMKilled status by Kubernetes, the container persists.

In fact, if I manually echo "1000" > /proc/1/oom_score_adj during container startup, to encourage the kernel's OOM killer to kill the container's init process first, things seem to work better and I'm unable to reproduce the issue.

I think that if instead of 1.26 which I am currently using, if I was using Kubernetes version 1.28 which sets memory.oom.group to 1, I would automatically see this desired behavior instead of the OOM killer killing container sub-processes but leaving the container alive.

Nonetheless, termination of random processes within a container still should not put the node into a bad state.

BTW, the kernel log in the K8s node should show the OOM kill events IIRC.

It does. In particular in the logs of at least one bad node the OOM killer only killed cc1plus subprocesses (spawned by a docker build, and executed by dockerd all within the pod's only container). It didn't for example kill any container init processes, or the docker daemon, or anything else. I actually see multiple OOM kills of these cc1plus processes in bad node kernel logs, but on "good nodes" I see zero. In these cases the docker build seems to get stuck and the CI job reports a timeout.

Having said that, when the pod's processes are killed by the OOM killer, you would expect that the pod as a whole would be terminated.

Hmm, that's certainly what I initially expected would happen, and it's what I want to have happen, but as far as I can tell from some recent reading, that's not necessarily what the kernel OOM killer will do, and it certainly doesn't look like it's what is happening according to the kernel logs described above. Instead it seems the OOM killer is killing a subprocess within the pod.

So there's a bug somewhere, but I don't know if its at the level of Kubectl, CRI-O, or Sysbox. I would need to reproduce and dig deep to see what's going on.

Yeah I'm also working at this but I've been finding it tricky so far to produce a minimal repro.

The implication here is that adding a memory limit on a pod is worse than not adding one, since OOM in either case has a chance to take out a node

I thought it's not taking out the full node correct, because you are able to launch a new pod (e.g., the alpine pod experiment we did above) and that works fine.

Not quite, to clarify:

I'm not able to launch a sysbox alpine pod on a bad node, I see the familiar issue in this case of k8s not being able to create the container.
I can start regular runc non-sysbox alpine pods on a bad node and they function normally. I'm also using kubectl debug to access system logs of bad nodes which is also creating new regular runc non-sysbox ubuntu pods.
If I start a sysbox alpine pod on a good node, and while the alpine pod is running the node becomes bad, then I can exec into it.

Feb 04 '24 09:02 top-oai

It's a bit of an aside, but I do notice that if I start a pod that runs stress-ng --vm-bytes 200M --vm-keep -m 1 --oomable with a memory limit of < 200Mi, then Kubernetes reports container status as OOMKilled when not running on sysbox, but Completed when running on sysbox. In both cases the exit code is 0 which is expected (oom killer kills child procss of stress-ng, at which point stress-ng exits 0), the difference is just that when not using sysbox, Kubernetes flags the status as OOMKilled, but not when sysbox is used.

Feb 04 '24 21:02 top-oai

Ahh, after scrounging through the kernel logs more, I now am noticing something else that is unique to bad nodes (I don't see this output in good node kernel logs), and this usually occurs after an oomkill:

Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel: INFO: task runc:[2:INIT]:1987683 blocked for more than 120 seconds.
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:       Not tainted 5.15.0-1053-azure #61-Ubuntu
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel: task:runc:[2:INIT]   state:D stack:    0 pid:1987683 ppid:1987654 flags:0x00004006
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel: Call Trace:
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  <TASK>
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  __schedule+0x243/0x5f0
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  schedule+0x69/0x110
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  request_wait_answer+0x138/0x210
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  ? wait_woken+0x70/0x70
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  fuse_simple_request+0x1b5/0x360
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  fuse_lookup_name+0x105/0x250
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  fuse_lookup+0x6d/0x1b0
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  __lookup_slow+0x88/0x150
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  walk_component+0x145/0x1c0
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  link_path_walk.part.0.constprop.0+0x23f/0x3a0
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  ? path_init+0x2c0/0x3f0
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  path_openat+0xb5/0x2b0
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  ? list_lru_add+0xeb/0x120
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  do_filp_open+0xb2/0x160
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  ? __check_object_size+0x1d/0x30
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  ? alloc_fd+0xb9/0x180
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  do_sys_openat2+0x9f/0x160
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  __x64_sys_openat+0x55/0x90
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  do_syscall_64+0x5c/0xc0
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  ? __rseq_handle_notify_resume+0x2d/0xc0
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  ? exit_to_user_mode_prepare+0x49/0x100
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  ? syscall_exit_to_user_mode+0x27/0x40
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  ? do_syscall_64+0x69/0xc0
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  ? do_syscall_64+0x69/0xc0
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  ? do_syscall_64+0x69/0xc0
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  ? do_syscall_64+0x69/0xc0
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  entry_SYSCALL_64_after_hwframe+0x62/0xcc
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel: RIP: 0033:0x40408e
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel: RSP: 002b:000000c00069e9f8 EFLAGS: 00000206 ORIG_RAX: 0000000000000101
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel: RAX: ffffffffffffffda RBX: ffffffffffffff9c RCX: 000000000040408e
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel: RDX: 0000000000080000 RSI: 000000c00013a840 RDI: ffffffffffffff9c
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel: RBP: 000000c00069ea38 R08: 0000000000000000 R09: 0000000000000000
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 000000c00013a840
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel: R13: 0000000000000000 R14: 000000c0000061a0 R15: ffffffffffffffff
Feb 05 02:19:23 aks-sysboxcpu1-20143122-vmss00001L kernel:  </TASK>

I don't have a lot of experience with fuse, but this seems like useful information. It doesn't look like quite the same thing, but after some searching around this does look like it has a few things in common with https://github.com/moby/moby/issues/42883. Seems like in the even of an oomkill, sometimes something gets deadlocked around fuse and that's why things are hanging instead of crashing?

Feb 05 '24 05:02 top-oai

Okay, one more update.

Unfortunately I don't have any new information to help debug the root cause of the issue, however I am now working around it. By upgrading from Kubernetes 1.26 -> 1.28, it activated the memory.oom.group cgroups v2 option and because of this, when a OOM kill is triggered, all processes in the container are killed, and as documented above, this seems to workaround the issue and I can have pods be OOM killed many times without having the node enter into a bad state.

Unfortunately though, since this memory.oom.group behavior is difficult to configure, I think that post-upgrade it will now be difficult to reproduce the issue even if I wanted to. I'm sorry that I wasn't able to distill a minimal repro case here, but it wasn't for lack of trying; even with the particular docker build step known to OOM isolated and repeatedly run in multiple nodes, it would not trigger the bug. It was perhaps a funny combination of events brought about by running our unique set of CI jobs that usually ran together, I suppose.

For what it's worth, I feel like my previous post, https://github.com/nestybox/sysbox/issues/771#issuecomment-1926267508, was the most promising clue so far about the root cause of this particular issue.

Feb 07 '24 02:02 top-oai

Thanks @top-oai for the extra info.

It's so hard to tell though, this is the type of issue that I need a local repro on in order to debug.

For the kernel stack trace you provided, what's interesting is that is on task runc:[2:INIT], which tells me it likely corresponds to the failed kubectl exec because runc:[2:INIT] is the name of the process that enters the container during kubectl exec, right before it execs into the binary passed to kubectl exec (e.g., say you run kubectl exec <pod> -- sh, then the process is briefly named runc:[2:INIT] right before it execs into sh).

The fuse_* functions in that backtrace are likely accesses that are intercepted by sysbox-fs. That is, sysbox-fs is a FUSE-based filesystem, mounted on subdirs of the container's /proc filesystem (e.g., /proc/sys). When doing kubectl exec into a Sysbox-based pod, sysbox-runc creates a process that enters the container. When that process enters the container, it may sometimes accesses files under /proc/sys to setup things, and those accesses will now go to sysbox-fs.

Thus, I think that backtrace is pointing to a problem in sysbox-fs, but not sure what problem it is though.

Hope that helps.

Feb 15 '24 05:02 ctalledo