dm_control
dm_control copied to clipboard
Support selecting an `EGL_DEVICE` by UUID rather than by index
I am encountering an error with deploying dm_control
in a managed HPC environment. Our admin decided to use UUID
for the device names, which causes dm_control
(and mujoco-py
) to raise error when parsing the available devices:
Traceback (most recent call last):
File "/Users/ge/mit/dmc_gen/dmc_gen_analysis/__init__.py", line 164, in thunk
File "/home/gridsan/geyang/jaynes-mount/dmc_gen/2021-03-05/085031.707344/dmc_gen/dmc_gen/train.py", line 58, in train
image_size=image_size,
File "/home/gridsan/geyang/jaynes-mount/dmc_gen/2021-03-05/085031.707344/dmc_gen/dmc_gen/wrappers.py", line 28, in make_env
frame_skip=action_repeat
File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dmc2gym/dmc2gym/__init__.py", line 55, in make
return gym.make(env_id)
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 145, in make
return registry.make(id, **kwargs)
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 90, in make
env = spec.make(**kwargs)
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 59, in make
cls = load(self.entry_point)
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/site-packages/gym/envs/registration.py", line 18, in load
mod = importlib.import_module(mod_name)
File "/home/gridsan/geyang/.conda/envs/dmcgen/lib/python3.6/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 994, in _gcd_import
File "<frozen importlib._bootstrap>", line 971, in _find_and_load
File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 678, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dmc2gym/dmc2gym/wrappers.py", line 2, in <module>
from dm_control import suite
File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dm_control/dm_control/suite/__init__.py", line 28, in <module>
from dm_control.suite import acrobot
File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dm_control/dm_control/suite/acrobot.py", line 24, in <module>
from dm_control import mujoco
File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dm_control/dm_control/mujoco/__init__.py", line 18, in <module>
from dm_control.mujoco.engine import action_spec
File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dm_control/dm_control/mujoco/engine.py", line 44, in <module>
from dm_control import _render
File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dm_control/dm_control/_render/__init__.py", line 67, in <module>
Renderer = import_func() # pylint: disable=invalid-name
File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dm_control/dm_control/_render/__init__.py", line 36, in _import_egl
from dm_control._render.pyopengl.egl_renderer import EGLContext
File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dm_control/dm_control/_render/pyopengl/egl_renderer.py", line 69, in <module>
EGL_DISPLAY = create_initialized_headless_egl_display()
File "/home/gridsan/geyang/mit/dmc_gen/custom_vendor/dm_control/dm_control/_render/pyopengl/egl_renderer.py", line 51, in create_initialized_headless_egl_display
devices = [devices[int(os.environ["CUDA_VISIBLE_DEVICES"])]]
ValueError: invalid literal for int() with base 10: 'GPU-a15dc796-f172-2e06-2283-cea8159bf118'
The reasoning behind this device uuid
is explained in the following email (and issue)
FYI, this is a known error where
dm_control
assumes CUDA_VISIBLE_DEVICES is an integer. We’re using NVIDIA’s UUID API to set the device names to the UUID, rather than the default. The problem with the default naming scheme (0,1,etc) is that it is not consistent. What’s listed as GPU 0 might change even within a job, which you can imagine would cause major problems if you have two people on a node, each allocated one GPU. This sort of alludes to what I’m talking about, but doesn’t get into using the UUID’s instead: https://stackoverflow.com/questions/26123252/inconsistency-of-ids-between-nvidia-smi-l-and-cudevicegetname. It’s a big oversight on Ray’s part to assume that the GPU names are integers, both Tensorflow and Pytorch don’t seem to have a problem with it. I think what they need to understand is that in a shared environment you have to make sure people use only the GPU that’s been allocated to them, and the way to do that is to use the UUID.
I'm not totally sure what this has to do with dm_control
, since we don't reference CUDA_VISIBLE_DEVICES
anywhere in our code - is this a local modification you've made?
Hey Alistair! it is great to hear back from you!
When I request a single gpu via gres=gpu:volta:1
on slurm, only one device is available if we inspect via nvidia-smi
, which is why I set CUDA_VISIBLE_DEVICES=0
in my run script.
The same error actually arrises from master of dm_control code base: I was on an older version.
site-packages/dm_control/dm_control/_render/pyopengl/egl_renderer.py", line 52, in create_initialized_headless_egl_display
else:
device_idx = int(selected_device)
if not 0 <= device_idx < len(all_devices):
raise RuntimeError(
f'EGL_DEVICE_ID must be an integer between 0 and '
f'{len(all_devices) - 1} (inclusive), got {device_idx}.')
candidates = all_devices[device_idx:device_idx + 1]
The relevant code is here: https://github.com/deepmind/dm_control/blob/master/dm_control/_render/pyopengl/egl_renderer.py#L50
def create_initialized_headless_egl_display():
"""Creates an initialized EGL display directly on a device."""
all_devices = EGL.eglQueryDevicesEXT()
selected_device = os.environ.get('EGL_DEVICE_ID', None)
if selected_device is None:
candidates = all_devices
else:
device_idx = int(selected_device)
if not 0 <= device_idx < len(all_devices):
raise RuntimeError(
f'EGL_DEVICE_ID must be an integer between 0 and '
f'{len(all_devices) - 1} (inclusive), got {device_idx}.')
candidates = all_devices[device_idx:device_idx + 1]
The reasons given to us by the MIT supercloud admin is that they have reasons to use non-integer device IDs, because the device ID changes during the same job when a single node shared between jobs. I have attached their response above.
This is not something we can change as users, and they seem to provide good reasons. so we are trying to figure out if there is anything that can be done that removes the requirement that device IDs being integers.
I see. Is there a reason you need to specify a particular device ID to use for rendering? The default behaviour is to use the first device that can be successfully initialised.
As you can see from the code linked above, we use eglQueryDevicesEXT
to enumerate the available devices. I'm not aware of any API methods that would allow us to obtain a display device by UUID, so I'm not sure what we can do about this from our end.
Hi Alastair, let me investigate a bit, will get back to you!