Ayush Ranjan

Results 172 comments of Ayush Ranjan

I tried running the above mentioned Docker image on an A100 with runsc, it segfaults and crashes with a different error: ``` Traceback (most recent call last): File "repro.py", line...

What Nvidia driver version is being used at Modal? I was testing on 525.105.17.

`0x0000cb33` allocation class is `NV_CONFIDENTIAL_COMPUTE`, which was only added in 535.43.02. So I tested again with 535.54.03 driver, and the above mentioned unknown ioctls dissapeared. However, the segfault persists. This...

RIP = 0x7eb1f7de9ddb ``` VMAs: ... 7eb1f7d68000-7eb1f7d8a000 r--p 00000000 00:19 34 /usr/lib/x86_64-linux-gnu/libc-2.31.so 7eb1f7d8a000-7eb1f7f02000 r-xp 00022000 00:19 34 /usr/lib/x86_64-linux-gnu/libc-2.31.so ``` So we need to look at offset `0x00022000 + (0x7eb1f7de9ddb -...

Using `objdump -d`: ``` 81dbd: 48 89 44 24 28 mov %rax,0x28(%rsp) 81dc2: 8b 44 24 24 mov 0x24(%rsp),%eax 81dc6: 48 8d 35 1d 42 11 00 lea 0x11421d(%rip),%rsi #...

@nixprime pointed out that I was looking at the objdump of the wrong file. He figured out the actual fault instruction: ``` # objdump -d /usr/lib/x86_64-linux-gnu/libc-2.31.so | less ... 0000000000081dd0...

The logs also show an unimplemented control command (`NV2080_CTRL_CMD_NVLINK_GET_NVLINK_CAPS`). I added support for it in https://github.com/google/gvisor/pull/9835. But it did not fix the issue. The logs also shows 4 user faults...

Do you know if the A100 GPU has 40GB memory or 80GB?

So the `BusError: no space left on device` error on page fault in gVisor happens because we are hitting tmpfs size limit for `/dev/shm`. The OCI spec shows that `/dev/shm`...

Strace logs show an interesting pattern: ``` I1223 19:14:31.881425 937434 strace.go:564] [ 15: 15] python3 E lstat(0x7ef4b30c5190 /dev/shm/ZVh4Mj, 0x7ef4b30c50a0) I1223 19:14:31.881442 937434 strace.go:602] [ 15: 15] python3 X lstat(0x7ef4b30c5190 /dev/shm/ZVh4Mj,...