Ayush Ranjan

Results 171 comments of Ayush Ranjan

This is a cuda-checkpoint issue: https://github.com/NVIDIA/cuda-checkpoint/issues/4.

FYI I ran the reproducer and got a different error. This is the same error from https://modal-public-assets.s3.amazonaws.com/gpu_ckpt_logs.zip: `W0522 21:57:27.787759 25674 util.go:64] FATAL ERROR: checkpoint failed: checkpointing container "d83b8fa4-08bd-4d69-9a6f-5c3c28e98856": encoding error:...

I can see from logs that you have a multi-container set up. The first container is a pause container. The second container is `nvidia-smi` container. Both have the `nvidia-container-runtime-hook` prestart...

Can you show all log files? Where there no `runsc.log.*.wait.txt` or `runsc.log.*.delete.txt` log files? Usually containerd checks if the container has stopped via `runsc wait`, which will update the container...

> Because of this, the gofer process for the second container is still running. I am not sure if this is the reason the container is reported as RUNNING. If...

- The `runsc.log.20241008-131727.790706.create.txt` and `runsc.log.20241008-131727.984895.delete.txt` files are from a different sandbox execution attempt. So I am disregarding those. - #11003 did make the gofer process exit. Can see the logs...

@PSKP-95 Does this issue not occur with runc (without gVisor)?

@zkoopmans From the debug logs provided above, it looks like the nvproxy flag is set.

> I think that's with the NVIDIA shim method he was using, no? Not sure if there is nvidia shim involved here. He's using the `nvidia-container-runtime`, which in turn uses...

Okay I was able to fix my setup (the issue was there was some lingering pod in the background somehow). I can confirm that fixing `/etc/containerd/config.toml` to specify `runtime_type =...