Ayush Ranjan
Ayush Ranjan
This is a cuda-checkpoint issue: https://github.com/NVIDIA/cuda-checkpoint/issues/4.
FYI I ran the reproducer and got a different error. This is the same error from https://modal-public-assets.s3.amazonaws.com/gpu_ckpt_logs.zip: `W0522 21:57:27.787759 25674 util.go:64] FATAL ERROR: checkpoint failed: checkpointing container "d83b8fa4-08bd-4d69-9a6f-5c3c28e98856": encoding error:...
I can see from logs that you have a multi-container set up. The first container is a pause container. The second container is `nvidia-smi` container. Both have the `nvidia-container-runtime-hook` prestart...
Can you show all log files? Where there no `runsc.log.*.wait.txt` or `runsc.log.*.delete.txt` log files? Usually containerd checks if the container has stopped via `runsc wait`, which will update the container...
> Because of this, the gofer process for the second container is still running. I am not sure if this is the reason the container is reported as RUNNING. If...
- The `runsc.log.20241008-131727.790706.create.txt` and `runsc.log.20241008-131727.984895.delete.txt` files are from a different sandbox execution attempt. So I am disregarding those. - #11003 did make the gofer process exit. Can see the logs...
@PSKP-95 Does this issue not occur with runc (without gVisor)?
@zkoopmans From the debug logs provided above, it looks like the nvproxy flag is set.
> I think that's with the NVIDIA shim method he was using, no? Not sure if there is nvidia shim involved here. He's using the `nvidia-container-runtime`, which in turn uses...
Okay I was able to fix my setup (the issue was there was some lingering pod in the background somehow). I can confirm that fixing `/etc/containerd/config.toml` to specify `runtime_type =...