dm_control Unify `EGL_DEVICE_ID` in `dm_control` with `MUJOCO_EGL_DEVICE

I'm using EGL backend for GPU rendering of mujoco environments. I am using MUJOCO_EGL_DEVICE_ID environment variable to select which GPU should run for rendering, but this has no effect.

MUJOCO_GL=egl MUJOCO_EGL_DEVICE_ID=0 CUDA_VISIBLE_DEVICES=0 python run.py
MUJOCO_GL=egl MUJOCO_EGL_DEVICE_ID=1 CUDA_VISIBLE_DEVICES=1 python run.py
MUJOCO_GL=egl MUJOCO_EGL_DEVICE_ID=2 CUDA_VISIBLE_DEVICES=2 python run.py
...

and all the rendering happens in the first GPU (index 0):

[0] NVIDIA TITAN Xp | 29°C,   3 % |  2058 / 12196 MB | python(607M) python(86M) python(86M) python(86M) python(86M) python(86M) python(86M) python(86M)
[1] NVIDIA TITAN Xp | 29°C,  14 % |   987 / 12196 MB | python(521M)
[2] NVIDIA TITAN Xp | 29°C,   7 % |   987 / 12196 MB | python(521M)
[3] NVIDIA TITAN Xp | 28°C,   8 % |   987 / 12196 MB | python(521M)
[4] NVIDIA TITAN Xp | 29°C,   7 % |   987 / 12196 MB | python(521M)
[5] NVIDIA TITAN Xp | 26°C,   7 % |   987 / 12196 MB | python(521M)
[6] NVIDIA TITAN Xp | 30°C,   6 % |   987 / 12196 MB | python(521M)
[7] NVIDIA TITAN Xp | 28°C,   7 % |   987 / 12196 MB | python(521M)

I checked the code at https://github.com/deepmind/mujoco/blob/main/python/mujoco/egl/init.py#L35, and it seems the correct display was selected out of 8 available GPUs (telling from the debugger). Initialization is successful, but in the backend the designated GPU was not chosen.

OS: Linux 20.04 focal LTS
MuJoCo version: 2.1.0 (tried 2.2.x as well)
EGL installation:

ii  libegl-dev:amd64                       1.3.2-1~ubuntu0.20.04.2               amd64        Vendor neutral GL dispatch library -- EGL development files
ii  libegl-mesa0:amd64                     21.2.6-0ubuntu0.1~20.04.2             amd64        free implementation of the EGL API -- Mesa vendor library
ii  libegl1:amd64                          1.3.2-1~ubuntu0.20.04.2               amd64        Vendor neutral GL dispatch library -- EGL support
ii  libegl1-mesa-dev:amd64                 21.2.6-0ubuntu0.1~20.04.2             amd64        free implementation of the EGL API -- development files
ii  libwayland-egl1:amd64                  1.18.0-1                              amd64        wayland compositor infrastructure - EGL library

ii  libopengl-dev:amd64                    1.3.2-1~ubuntu0.20.04.2               amd64        Vendor neutral GL dispatch library -- OpenGL development files
ii  libopengl0:amd64                       1.3.2-1~ubuntu0.20.04.2               amd64        Vendor neutral GL dispatch library -- OpenGL support

python: PyOpenGL==3.1.6 (latest)
I'm using dm_control with mujoco==2.2.0 python bindings. (not mujoco_py)

Aug 29 '22 16:08 wookayin

I see 521M allocations attributed to python on devices 1-7, do you know what those are for?

Aug 30 '22 10:08 saran-t

They are cuda (jax/tf) programs and managed by CUDA_VISIBLE_DEVICES. Rendering is all done in GPU 0 (the processes each using 86M of memory).

My apologies for the confusion.

Aug 30 '22 21:08 wookayin

Can you please help me debug this by inserting a print statement in https://github.com/deepmind/mujoco/blob/main/python/mujoco/egl/init.py#L48 to check the contents of candidates and all_devices? Thank you.

Sep 07 '22 14:09 saran-t

@saran-t I've done this before when I submitted the issue, but here's the result for you:

e.g., When MUJOCO_EGL_DEVICE_ID=1 and MUJOCO_GL=egl :

all_devices=[
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd05095c0>, 
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd0509640>, 
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd05096c0>, 
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd0509740>, 
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd05097c0>, 
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd0509840>, 
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd05098c0>, 
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd0509940>, 
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd05099c0>, 
<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd0509ac0>
]
selected_device='1'
candidates=[<OpenGL._opaque.EGLDeviceEXT_pointer object at 0x7f5fd0509640>]

# the one being returned in Line 61, same as `mujoco.egl.EGL_DISPLAY`
SUCCESS display=<OpenGL._opaque.EGLDisplay_pointer object at 0x7f5fd0509c40>

So the logic for choosing EGL devices seems correct, so it might be a problem of EGL backend in my environment.

Sep 07 '22 14:09 wookayin

@saran-t I figured out.

The code I was trying is actually dm_control.Environment.physics.render() (sorry I didn't include this as full information):

import dm_control.suite
env = dm_control.suite.load("humanoid", "walk")
env.physics.render()

Debugging step-by-step, this EGL.eglMakeCurrent line is where the GPU context is actually being created:

https://github.com/deepmind/dm_control/blob/main/dm_control/_render/pyopengl/egl_renderer.py#L125

and dm_control._render.pyopengl.egl_renderer has exactly the same logic of choosing EGL DEVICES as mujoco.egl does! I assumed that dm_control uses mujoco internally for handling EGL and rendering, but they are different codebase (duplicates). The key difference is that dm_control's renderer uses EGL_DEVICE_ID environment variable, not MUJOCO_EGL_DEVICE_ID. This seems quite confusing..

So setting the environment variable EGL_DEVICE_ID=... works well for dm_control use the correct GPU for rendering. So the bug I experienced turns out to be actually an issue of dm_control --- any room for improvement here?

Sep 07 '22 15:09 wookayin

Yes, that's on our cleanup backlog. Render context management is the one thing that we haven't unified across mujoco and dm_control.

I'll transfer this issue over to the dm_control repo.

Sep 07 '22 15:09 saran-t

I completely forgot about this! Going to fix it for the upcoming dm_control release.

Dec 05 '22 15:12 saran-t

dm_control
dm_control copied to clipboard

Unify `EGL_DEVICE_ID` in `dm_control` with `MUJOCO_EGL_DEVICE_ID` in `mujoco`

dm_control dm_control copied to clipboard

Unify `EGL_DEVICE_ID` in `dm_control` with `MUJOCO_EGL_DEVICE_ID` in `mujoco`

dm_control
dm_control copied to clipboard