dm_control
dm_control copied to clipboard
Unify `EGL_DEVICE_ID` in `dm_control` with `MUJOCO_EGL_DEVICE_ID` in `mujoco`
I'm using EGL backend for GPU rendering of mujoco environments. I am using MUJOCO_EGL_DEVICE_ID environment variable to select which GPU should run for rendering, but this has no effect.
MUJOCO_GL=egl MUJOCO_EGL_DEVICE_ID=0 CUDA_VISIBLE_DEVICES=0 python run.py
MUJOCO_GL=egl MUJOCO_EGL_DEVICE_ID=1 CUDA_VISIBLE_DEVICES=1 python run.py
MUJOCO_GL=egl MUJOCO_EGL_DEVICE_ID=2 CUDA_VISIBLE_DEVICES=2 python run.py
...
and all the rendering happens in the first GPU (index 0):
[0] NVIDIA TITAN Xp | 29°C, 3 % | 2058 / 12196 MB | python(607M) python(86M) python(86M) python(86M) python(86M) python(86M) python(86M) python(86M)
[1] NVIDIA TITAN Xp | 29°C, 14 % | 987 / 12196 MB | python(521M)
[2] NVIDIA TITAN Xp | 29°C, 7 % | 987 / 12196 MB | python(521M)
[3] NVIDIA TITAN Xp | 28°C, 8 % | 987 / 12196 MB | python(521M)
[4] NVIDIA TITAN Xp | 29°C, 7 % | 987 / 12196 MB | python(521M)
[5] NVIDIA TITAN Xp | 26°C, 7 % | 987 / 12196 MB | python(521M)
[6] NVIDIA TITAN Xp | 30°C, 6 % | 987 / 12196 MB | python(521M)
[7] NVIDIA TITAN Xp | 28°C, 7 % | 987 / 12196 MB | python(521M)
I checked the code at https://github.com/deepmind/mujoco/blob/main/python/mujoco/egl/init.py#L35, and it seems the correct display was selected out of 8 available GPUs (telling from the debugger). Initialization is successful, but in the backend the designated GPU was not chosen.
- OS: Linux 20.04 focal LTS
- MuJoCo version: 2.1.0 (tried 2.2.x as well)
- EGL installation:
ii libegl-dev:amd64 1.3.2-1~ubuntu0.20.04.2 amd64 Vendor neutral GL dispatch library -- EGL development files
ii libegl-mesa0:amd64 21.2.6-0ubuntu0.1~20.04.2 amd64 free implementation of the EGL API -- Mesa vendor library
ii libegl1:amd64 1.3.2-1~ubuntu0.20.04.2 amd64 Vendor neutral GL dispatch library -- EGL support
ii libegl1-mesa-dev:amd64 21.2.6-0ubuntu0.1~20.04.2 amd64 free implementation of the EGL API -- development files
ii libwayland-egl1:amd64 1.18.0-1 amd64 wayland compositor infrastructure - EGL library
ii libopengl-dev:amd64 1.3.2-1~ubuntu0.20.04.2 amd64 Vendor neutral GL dispatch library -- OpenGL development files
ii libopengl0:amd64 1.3.2-1~ubuntu0.20.04.2 amd64 Vendor neutral GL dispatch library -- OpenGL support
- python:
PyOpenGL==3.1.6(latest) - I'm using dm_control with mujoco==2.2.0 python bindings. (not mujoco_py)
I see 521M allocations attributed to python on devices 1-7, do you know what those are for?
They are cuda (jax/tf) programs and managed by CUDA_VISIBLE_DEVICES. Rendering is all done in GPU 0 (the processes each using 86M of memory).
My apologies for the confusion.
Can you please help me debug this by inserting a print statement in https://github.com/deepmind/mujoco/blob/main/python/mujoco/egl/init.py#L48 to check the contents of candidates and all_devices? Thank you.
@saran-t I've done this before when I submitted the issue, but here's the result for you:
e.g., When MUJOCO_EGL_DEVICE_ID=1 and MUJOCO_GL=egl :
all_devices=[
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd05095c0>,
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd0509640>,
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd05096c0>,
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd0509740>,
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd05097c0>,
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd0509840>,
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd05098c0>,
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd0509940>,
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd05099c0>,
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd0509ac0>
]
selected_device='1'
candidates=[<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd0509640>]
# the one being returned in Line 61, same as `mujoco.egl.EGL_DISPLAY`
SUCCESS display=<OpenGL._opaque.EGLDisplay_pointer object at 0x7f5fd0509c40>
So the logic for choosing EGL devices seems correct, so it might be a problem of EGL backend in my environment.
@saran-t I figured out.
The code I was trying is actually dm_control.Environment.physics.render() (sorry I didn't include this as full information):
import dm_control.suite
env = dm_control.suite.load("humanoid", "walk")
env.physics.render()
Debugging step-by-step, this EGL.eglMakeCurrent line is where the GPU context is actually being created:
https://github.com/deepmind/dm_control/blob/main/dm_control/_render/pyopengl/egl_renderer.py#L125
and dm_control._render.pyopengl.egl_renderer has exactly the same logic of choosing EGL DEVICES as mujoco.egl does! I assumed that dm_control uses mujoco internally for handling EGL and rendering, but they are different codebase (duplicates). The key difference is that dm_control's renderer uses EGL_DEVICE_ID environment variable, not MUJOCO_EGL_DEVICE_ID. This seems quite confusing..
So setting the environment variable EGL_DEVICE_ID=... works well for dm_control use the correct GPU for rendering. So the bug I experienced turns out to be actually an issue of dm_control --- any room for improvement here?
Yes, that's on our cleanup backlog. Render context management is the one thing that we haven't unified across mujoco and dm_control.
I'll transfer this issue over to the dm_control repo.
I completely forgot about this! Going to fix it for the upcoming dm_control release.