gvisor
gvisor copied to clipboard
nvproxy: support Nvidia H100 and confidential mode
The changes in this PR enable Nvidia H100 support in gVisor, including the H100's confidential compute mode. We have tested these patches by running examples from cuda-samples(vectorAdd, vectorAddDrv, matrixMul) and by running a llama based tensorrt-llm engine inside gVisor. We also successfully ran Nvidia's Triton inference server. For the tensorrt-llm engines you also need to disable calls (example) to open-mpi. For Triton one also needs to stub hostnet's SIOCGIFADDR ioctl implementation. We can add the SIOCGIFADDR stub (it just returns 0), but this didn't seem appropriate for upstream.
I want to highlight the change in bfe2eb7ab838c73767a0906a1b509bc67f75e276. I am not sure what implications it has. Could someone familiar with that flag explain? We have mainly observed that is disables merging of memory mappings inside sentry. Which is the behavior we were looking for.
I am not sure if this requires more unittests to be merged. Happy to add something if you could tell me where/what.
Fixes https://github.com/google/gvisor/issues/9846 Requires https://github.com/google/gvisor/pull/10008
We have tested these patches by running examples from cuda-samples(vectorAdd, vectorAddDrv, matrixMul) I am not sure if this requires more unittests to be merged. Happy to add something if you could tell me where/what.
Awesome, would be nice to add those CUDA samples to our GPU test suite as well! Feel free to do this in a separate PR. Our GPU tests here: https://github.com/google/gvisor/tree/master/test/gpu. Those tests reference images defined here: https://github.com/google/gvisor/tree/master/images/gpu.
@ayushr2 can this merge?
Ah whoops! I forgot the "internal safe review approval" thing. @derpsteb could you rebase and fix the merge conflicts and I will get this merged.
@ayushr2 we're keen on this so could one of us do the rebase and merge conflict fix? 🙂
I have fixed it up in #10234. Will submit that.
Thanks! 🙏
Hey. Thanks for rebasing this. I was on holidays. Seems like there is no todo for me here atm.
@ayushr2 is https://github.com/google/gvisor/pull/10234 queued and will merge on it own?
Urgh that is blocked on an internal test breakage... If it is urgent, I can bypass those breakages and submit.
Thanks @ayushr2! Wasn't urgent, but having gVisor support for H100s will let us better fight off cryptominer pests!