SAPIEN icon indicating copy to clipboard operation
SAPIEN copied to clipboard

GPU 0 is always used in a multi-GPU setup

Open nikolai-franke opened this issue 2 years ago • 6 comments

System:

  • OS version: Red Hat Enterprise Linux (RHEL) 8.x
  • Python version: Python 3.10 and Python 3.9
  • SAPIEN version: sapien==2.2.2
  • Environment: Server with xvfb

Describe the bug SAPIEN always uses GPU 0 in multi-GPU setup in addition to the GPU specified by CUDA_VISIBLE_DEVICES

To Reproduce

  1. Run modified examples/robotics/basic_robot.py script (the only difference is that there is no Viewer) https://pastebin.com/abuJeuVG with CUDA_VISIBLE_DEVICES=0
  2. Run modified examples/robotics/basic_robot.py script (the only difference is that there is no Viewer) https://pastebin.com/abuJeuVG with CUDA_VISIBLE_DEVICES=1

Expected behavior Checking the GPU usage, only the selected GPU should be used. For CUDA_VISIBLE_DEVICES=0, that is the case. For CUDA_VISIBLE_DEVICES=1, both GPU 0 and GPU 1 get used.

Screenshots CUDA_VISIBLE_DEVICES=0: cuda_0 CUDA_VISIBLE_DEVICES=1: cuda_1

Additional context Even though GPU 0 only gets used a bit when CUDA_VISIBLE_DEVICES=1, this usage quickly adds up when running many parallel simulations. I am using ManiSkill2 for Reinforcement Learning on an HPC node with 4 Nvidia A100 GPUs and this bug severely limits the number of parallel environments I can run. Additionally, running many parallel environments becomes slow, since GPU 0 is used by every single simulation environment instead of just 1/4th of the simulations.

nikolai-franke avatar Oct 19 '23 06:10 nikolai-franke

You may try passing offscreen_only=True to SapienRenderer constructor. This behavior will be changed in the future (to make CUDA device take higher priority than on-screen rendering)

fbxiang avatar Oct 23 '23 22:10 fbxiang

Passing offscreen_only=True doesn't make a difference.

nikolai-franke avatar Oct 25 '23 05:10 nikolai-franke

I cannot figure out what is causing the issue. I think you should set the pci id of the device you want to use directly. This method requires a bit setup but should never fail. First, before creating anything with SAPIEN, run sapien.SapienRenderer.set_log_level("info"). Next, run your code. You will see a table listing devices visible to Vulkan. From there, you will see all your GPUs with a field PciBus. The PciBus is unique to each of your physical GPU. Next when you create SapienRenderer, you can pass device="pci:x" where x is the PciBus id shown in the log. This should bypass all other checks.

fbxiang avatar Nov 11 '23 21:11 fbxiang

Thank you very much for your answer! Sadly the result is still exactly the same. GPU 0 always gets used, even when selecting another GPU via PCI address.

nikolai-franke avatar Nov 12 '23 10:11 nikolai-franke

Are you using sapien==2.2.2? I have verified that the GPU selection feature is working. You can try sapien.SapienRenderer.set_log_level("info") before creating the renderer. It will list all available GPUs to the console and tell you which GPU is selected for rendering. Since an incorrect pci id will result in an error, I guess that maybe some other program is running on your GPU 0 and it is not SAPIEN renderer.

fbxiang avatar Nov 22 '23 05:11 fbxiang

I'm actually having the same issue.

balazsgyenes avatar Dec 21 '23 11:12 balazsgyenes