Jonathon Belotti

Results 166 comments of Jonathon Belotti

> Believe it or not we have trouble getting access to these types of machines even for our own testing. 😅 jeez, it's rough out there. **H100** ``` 1: lo:...

Thanks for the details comments @kevinGC! I think they mostly make sense to me, but I'll work through the details more carefully tomorrow while also testing out https://github.com/google/gvisor/commit/94c10243701c6a5d884c0f5f106d65ad34e6729d. To answer...

### **Testing result:** https://github.com/google/gvisor/commit/94c10243701c6a5d884c0f5f106d65ad34e6729d ``` [modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$ ./runsc --version runsc version release-20240506.0-13-g94c10243701c spec: 1.1.0-rc.1 [modal@gcp-h100-us-east4-a-0-120e6a37-350e-4d92-8b7c-507f678ee562 ~]$ sudo ./runsc do ./speedtest-cli --secure Retrieving speedtest.net configuration... Testing from Google Cloud (34.48.63.7)... Retrieving...

`--gso=false` does indeed improve upload! ``` [modal@gcp-h100-us-east4-a-0-a275c742-c07d-433e-bcc0-46bf967048d7 ~]$ sudo ./production/runsc -gso=false do ./speedtest-cli --secure Retrieving speedtest.net configuration... Testing from Google Cloud (34.86.32.183)... Retrieving speedtest.net server list... Selecting best server based...

Also worth noting that this implementation is for driver version 535. The latest driver has different params for `NV0000_CTRL_CMD_OS_UNIX_GET_EXPORT_OBJECT_INFO`.

https://modal-public-assets.s3.amazonaws.com/runsc.log.20240512-202107.171204.boot.txt.zip is debug logs of the program above (~150MiB). * **uname -a** — `Linux gcp-a100-80gb-spot-europe-west4-a-0-b819afa2-755d-47d0-b84d-667 5.15.0-205.149.5.4.el9uek.x86_64 #2 SMP Wed May 8 15:31:38 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux` * **instance...

The reproduction program is almost identical to the one in https://github.com/google/gvisor/issues/9827, which is why I revisited that issue's test.

* Oh yep, fixed that in the original description. * Our `--shm-size` is also set very large. On Oracle workers it's around 1657GB. We have `Driver Version: 535.129.03 CUDA Version:...

> Surprisingly, this workload gets stuck without gVisor. Interesting. This may be the same problem as in https://github.com/google/gvisor/issues/9827 where the test got stuck on `runc`. The program doesn't get stuck...