Andrew Top comments

Results 17 comments of


                                            Andrew Top

AKS sysbox nodes sporadically enter bad state (`stat failed on /var/lib/containers/storage/overlay/...`)

Hi @ctalledo , I really appreciate the thoughtful response. > Thanks for giving Sysbox a shot, hope you find it helpful (after resolving the issues you are having). Me too!...

AKS sysbox nodes sporadically enter bad state (`stat failed on /var/lib/containers/storage/overlay/...`)

> > If you launch a regular privileged pod with Docker inside (without sysbox), does that work fine? > I'm not sure. It takes quite a few runs and time...

AKS sysbox nodes sporadically enter bad state (`stat failed on /var/lib/containers/storage/overlay/...`)

Thanks for reading through and providing background for everything. > Let me dig into the Sysbox code to see why `kubectl exec` would fail that way. Sounds good, thank you!...

AKS sysbox nodes sporadically enter bad state (`stat failed on /var/lib/containers/storage/overlay/...`)

Okay, I've applied the following pod: ``` apiVersion: v1 kind: Pod metadata: name: alpine namespace: default annotations: io.kubernetes.cri-o.userns-mode: "auto:size=65536" spec: runtimeClassName: sysbox-runc containers: - image: alpine:latest command: - /bin/sh -...

AKS sysbox nodes sporadically enter bad state (`stat failed on /var/lib/containers/storage/overlay/...`)

> Let me see if I can get into this bad state while the pod is already successfully deployed. Okay I managed to start an alpine pod on a node,...

AKS sysbox nodes sporadically enter bad state (`stat failed on /var/lib/containers/storage/overlay/...`)

Aha, okay, I think I found the cause. These pods are running CI jobs, and a few of them were failing (consistently). I had originally planned to dive deeper into...

AKS sysbox nodes sporadically enter bad state (`stat failed on /var/lib/containers/storage/overlay/...`)

> > It turned out that they were hitting the pod's memory limit and getting oom killed. > > I see; what must be happening then is that the container's...

AKS sysbox nodes sporadically enter bad state (`stat failed on /var/lib/containers/storage/overlay/...`)

It's a bit of an aside, but I do notice that if I start a pod that runs `stress-ng --vm-bytes 200M --vm-keep -m 1 --oomable` with a memory limit of...

AKS sysbox nodes sporadically enter bad state (`stat failed on /var/lib/containers/storage/overlay/...`)

Ahh, after scrounging through the kernel logs more, I now am noticing something else that is unique to bad nodes (I don't see this output in good node kernel logs),...

AKS sysbox nodes sporadically enter bad state (`stat failed on /var/lib/containers/storage/overlay/...`)

Okay, one more update. Unfortunately I don't have any new information to help debug the root cause of the issue, however I *am* now working around it. By upgrading from...