dm_control
dm_control copied to clipboard
Noisy or unreadable images when rendering
Hi!
In some instances (embodied algos in my case) the new mujoco rendering gives unreadable images after a little while, e.g. here's a grid of 3 views of the same body:

This occurs after a little while, i.e. the first images rendered are perfectly fine. I tried to narrow it down to a minimal reproductible example but I can't find a way to do it (sorry about that!) When using the old bindings (mujoco-py and such) thiis issue disappears.
I'm using MUJOCO_GL=egl and have installed glew in my conda (working on a cluster where I have no sudo access).
I'm working with either G100 or A100 GPUs, and using them for training and rendering. Also to mention: I'm running a bunch of envs in parallel (not multithrerad but multiprocessing) for fast collection of data.
Here is my conda env
# packages in environment at /fsx/users/vmoens/conda/envs/rl4:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
_tflow_select 2.3.0 mkl
absl-py 1.0.0 pyhd8ed1ab_0 conda-forge
aiohttp 3.8.1 py39hb9d737c_1 conda-forge
aiosignal 1.2.0 pyhd8ed1ab_0 conda-forge
ale-py 0.7.5 pypi_0 pypi
alsa-lib 1.2.6.1 h7f98852_0 conda-forge
anyio 3.6.1 pypi_0 pypi
aom 3.3.0 h27087fc_1 conda-forge
argon2-cffi 21.3.0 pyhd8ed1ab_0 conda-forge
argon2-cffi-bindings 21.2.0 py39hb9d737c_2 conda-forge
astor 0.8.1 pyh9f0ad1d_0 conda-forge
asttokens 2.0.5 pyhd8ed1ab_0 conda-forge
astunparse 1.6.3 pyhd8ed1ab_0 conda-forge
async-timeout 4.0.2 pyhd8ed1ab_0 conda-forge
atari-py 0.2.9 pypi_0 pypi
attr 2.5.1 h166bdaf_0 conda-forge
attrs 21.4.0 pyhd8ed1ab_0 conda-forge
autorom 0.4.2 pypi_0 pypi
autorom-accept-rom-license 0.4.2 pypi_0 pypi
babel 2.10.1 pypi_0 pypi
backcall 0.2.0 pyh9f0ad1d_0 conda-forge
backports 1.0 py_2 conda-forge
backports.functools_lru_cache 1.6.4 pyhd8ed1ab_0 conda-forge
beautifulsoup4 4.11.1 pyha770c72_0 conda-forge
blas 1.0 mkl
bleach 5.0.0 pyhd8ed1ab_0 conda-forge
blinker 1.4 py_1 conda-forge
brotlipy 0.7.0 py39hb9d737c_1004 conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
c-ares 1.18.1 h7f98852_0 conda-forge
ca-certificates 2022.5.18.1 ha878542_0 conda-forge
cachetools 5.2.0 pypi_0 pypi
certifi 2022.5.18.1 pypi_0 pypi
cffi 1.15.0 py39h4bc2ebd_0 conda-forge
charset-normalizer 2.0.12 pyhd8ed1ab_0 conda-forge
click 8.1.3 py39hf3d152e_0 conda-forge
cloudpickle 1.2.2 pypi_0 pypi
configargparse 1.5.3 pypi_0 pypi
cryptography 37.0.1 py39h9ce1e76_0
cudatoolkit 11.3.1 h2bc3f7f_2
cycler 0.11.0 pypi_0 pypi
cython 0.29.30 pypi_0 pypi
dbus 1.13.6 h5008d03_3 conda-forge
debugpy 1.6.0 py39h5a03fae_0 conda-forge
decorator 4.4.2 pypi_0 pypi
defusedxml 0.7.1 pyhd8ed1ab_0 conda-forge
dm-control 1.0.3.post1 pypi_0 pypi
dm-env 1.5 pypi_0 pypi
dm-tree 0.1.7 pypi_0 pypi
elfutils 0.186 he364ef2_0 conda-forge
entrypoints 0.4 pyhd8ed1ab_0 conda-forge
executing 0.8.3 pyhd8ed1ab_0 conda-forge
expat 2.4.8 h27087fc_0 conda-forge
fasteners 0.17.3 pypi_0 pypi
ffmpeg 5.0.1 h964e5f1_2 conda-forge
fftw 3.3.8 nompi_hfc0cae8_1114 conda-forge
flatten-dict 0.4.2 pypi_0 pypi
flit-core 3.7.1 pyhd8ed1ab_0 conda-forge
font-ttf-dejavu-sans-mono 2.37 hab24e00_0 conda-forge
font-ttf-inconsolata 3.000 h77eed37_0 conda-forge
font-ttf-source-code-pro 2.038 h77eed37_0 conda-forge
font-ttf-ubuntu 0.83 hab24e00_0 conda-forge
fontconfig 2.14.0 h8e229c2_0 conda-forge
fonts-conda-ecosystem 1 0 conda-forge
fonts-conda-forge 1 0 conda-forge
fonttools 4.33.3 pypi_0 pypi
freetype 2.10.4 h0708190_1 conda-forge
freetype-py 2.3.0 pypi_0 pypi
frozenlist 1.3.0 py39hb9d737c_1 conda-forge
functorch 0.3.0a0+693bcee pypi_0 pypi
gast 0.4.0 pyh9f0ad1d_0 conda-forge
gettext 0.19.8.1 h73d1719_1008 conda-forge
giflib 5.2.1 h36c2ea0_2 conda-forge
glew 2.1.0 h9c3ff4c_2 conda-forge
glew-osmesa 1.13.0.20151117 0 menpo
glfw 2.5.3 pypi_0 pypi
glfw3 3.2.1 0 menpo
gmp 6.2.1 h58526e2_0 conda-forge
gnutls 3.6.13 h85f3911_1 conda-forge
google-auth 2.6.6 pypi_0 pypi
google-auth-oauthlib 0.4.6 pyhd8ed1ab_0 conda-forge
google-pasta 0.2.0 pyh8c360ce_0 conda-forge
grpcio 1.46.3 py39h0f497a6_0 conda-forge
gst-plugins-base 1.20.2 hf6a322e_1 conda-forge
gstreamer 1.20.2 hd4edc92_1 conda-forge
gym 0.24.1 pypi_0 pypi
gym-notices 0.0.6 pypi_0 pypi
h5py 2.10.0 nompi_py39h98ba4bc_106 conda-forge
hdf5 1.10.6 nompi_h3c11f04_101 conda-forge
icu 70.1 h27087fc_0 conda-forge
idna 3.3 pyhd8ed1ab_0 conda-forge
imageio 2.19.3 pypi_0 pypi
imageio-ffmpeg 0.4.7 pypi_0 pypi
importlib-metadata 4.11.4 py39hf3d152e_0 conda-forge
importlib_resources 5.7.1 pyhd8ed1ab_1 conda-forge
iniconfig 1.1.1 pypi_0 pypi
intel-openmp 2021.4.0 h06a4308_3561
ipykernel 6.13.0 pypi_0 pypi
ipython 8.4.0 py39hf3d152e_0 conda-forge
ipython-genutils 0.2.0 pypi_0 pypi
ipython_genutils 0.2.0 py_1 conda-forge
ipywidgets 7.7.0 pyhd8ed1ab_0 conda-forge
jack 1.9.18 h8c3723f_1002 conda-forge
jedi 0.18.1 py39hf3d152e_1 conda-forge
jinja2 3.1.2 pyhd8ed1ab_1 conda-forge
jpeg 9e h166bdaf_1 conda-forge
json-c 0.16 hc379101_0 conda-forge
json5 0.9.8 pypi_0 pypi
jsonschema 4.5.1 pypi_0 pypi
jupyter 1.0.0 py39hf3d152e_7 conda-forge
jupyter-client 7.3.1 pypi_0 pypi
jupyter-server 1.17.0 pypi_0 pypi
jupyter_client 7.3.4 pyhd8ed1ab_0 conda-forge
jupyter_console 6.4.3 pyhd8ed1ab_0 conda-forge
jupyter_core 4.10.0 py39hf3d152e_0 conda-forge
jupyterlab 3.4.2 pypi_0 pypi
jupyterlab-server 2.14.0 pypi_0 pypi
jupyterlab_pygments 0.2.2 pyhd8ed1ab_0 conda-forge
jupyterlab_widgets 1.1.0 pyhd8ed1ab_0 conda-forge
keras-preprocessing 1.1.2 pyhd8ed1ab_0 conda-forge
keyutils 1.6.1 h166bdaf_0 conda-forge
kiwisolver 1.4.2 pypi_0 pypi
krb5 1.19.3 h3790be6_0 conda-forge
labmaze 1.0.5 pypi_0 pypi
lame 3.100 h7f98852_1001 conda-forge
lcms2 2.12 hddcbb42_0 conda-forge
ld_impl_linux-64 2.36.1 hea4e1c9_2 conda-forge
lerc 3.0 h9c3ff4c_0 conda-forge
libarchive 3.5.2 hb890918_2 conda-forge
libcap 2.64 ha37c62d_0 conda-forge
libclang 14.0.4 default_h2e3cab8_0 conda-forge
libclang13 14.0.4 default_h3a83d3e_0 conda-forge
libcups 2.3.3 hf5a7f15_1 conda-forge
libcurl 7.83.1 h7bff187_0 conda-forge
libdb 6.2.32 h9c3ff4c_0 conda-forge
libdeflate 1.10 h7f98852_0 conda-forge
libdrm 2.4.109 h7f98852_0 conda-forge
libdrm-cos6-x86_64 2.4.65 4 anaconda
libedit 3.1.20191231 he28a2e2_2 conda-forge
libev 4.33 h516909a_1 conda-forge
libevent 2.1.10 h9b69904_4 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libflac 1.3.4 h27087fc_0 conda-forge
libgcc-ng 12.1.0 h8d9b700_16 conda-forge
libgfortran-ng 7.5.0 h14aa051_20 conda-forge
libgfortran4 7.5.0 h14aa051_20 conda-forge
libglib 2.70.2 h174f98d_4 conda-forge
libglu 9.0.0 he1b5a44_1001 conda-forge
libgomp 12.1.0 h8d9b700_16 conda-forge
libiconv 1.16 h516909a_0 conda-forge
libllvm14 14.0.4 he0ac6c6_0 conda-forge
libmicrohttpd 0.9.75 h7f98852_0 conda-forge
libnghttp2 1.47.0 h727a467_0 conda-forge
libnsl 2.0.0 h7f98852_0 conda-forge
libogg 1.3.4 h7f98852_1 conda-forge
libopus 1.3.1 h7f98852_1 conda-forge
libpciaccess 0.16 h516909a_0 conda-forge
libpng 1.6.37 h21135ba_2 conda-forge
libpq 14.3 hd77ab85_0 conda-forge
libprotobuf 3.20.1 h6239696_0 conda-forge
libsndfile 1.0.31 h9c3ff4c_1 conda-forge
libsodium 1.0.18 h36c2ea0_1 conda-forge
libssh2 1.10.0 ha56f1ee_2 conda-forge
libstdcxx-ng 12.1.0 ha89aaad_16 conda-forge
libtiff 4.3.0 h0fcbabc_4 conda-forge
libtool 2.4.6 h9c3ff4c_1008 conda-forge
libuuid 2.32.1 h7f98852_1000 conda-forge
libva 2.14.0 h7f98852_0 conda-forge
libvorbis 1.3.7 h9c3ff4c_0 conda-forge
libvpx 1.11.0 h9c3ff4c_3 conda-forge
libwebp 1.2.2 h3452ae3_0 conda-forge
libwebp-base 1.2.2 h7f98852_1 conda-forge
libx11-common-cos6-x86_64 1.6.4 4 anaconda
libx11-cos6-x86_64 1.6.4 4 anaconda
libxcb 1.13 h7f98852_1004 conda-forge
libxkbcommon 1.0.3 he3ba5ed_0 conda-forge
libxml2 2.9.14 h22db469_0 conda-forge
libzlib 1.2.12 h166bdaf_0 conda-forge
lockfile 0.12.2 pypi_0 pypi
lxml 4.8.0 pypi_0 pypi
lz4-c 1.9.3 h9c3ff4c_1 conda-forge
lzo 2.10 h516909a_1000 conda-forge
markdown 3.3.7 pyhd8ed1ab_0 conda-forge
markupsafe 2.1.1 py39hb9d737c_1 conda-forge
matplotlib 3.5.2 pypi_0 pypi
matplotlib-inline 0.1.3 pyhd8ed1ab_0 conda-forge
mesa-libgl-cos6-x86_64 11.0.7 4 anaconda
mesalib 21.2.5 h0e4506f_3 conda-forge
mistune 0.8.4 py39h3811e60_1005 conda-forge
mj-envs 1.0.0 dev_0 <develop>
mjrl 0.1.1 dev_0 <develop>
mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py39h7e14d7c_0 conda-forge
mkl_fft 1.3.1 py39h0c7bc48_1 conda-forge
mkl_random 1.2.2 py39hde0f152_0 conda-forge
moviepy 1.0.3 pypi_0 pypi
mujoco 2.2.0 pypi_0 pypi
mujoco-py 2.0.2.2 dev_0 <develop>
multidict 6.0.2 py39hb9d737c_1 conda-forge
mysql-common 8.0.29 haf5c9bc_1 conda-forge
mysql-libs 8.0.29 h28c427c_1 conda-forge
nbclassic 0.3.7 pypi_0 pypi
nbclient 0.6.3 pypi_0 pypi
nbconvert 6.5.0 pyhd8ed1ab_0 conda-forge
nbconvert-core 6.5.0 pyhd8ed1ab_0 conda-forge
nbconvert-pandoc 6.5.0 pyhd8ed1ab_0 conda-forge
nbformat 5.4.0 pyhd8ed1ab_0 conda-forge
ncurses 6.3 h27087fc_1 conda-forge
nest-asyncio 1.5.5 pyhd8ed1ab_0 conda-forge
nettle 3.6 he412f7d_0 conda-forge
networkx 2.8.3 pypi_0 pypi
notebook 6.4.11 pypi_0 pypi
notebook-shim 0.1.0 pypi_0 pypi
nspr 4.32 h9c3ff4c_1 conda-forge
nss 3.78 h2350873_0 conda-forge
numpy 1.22.4 pypi_0 pypi
numpy-base 1.22.3 py39hf524024_0
oauthlib 3.2.0 pyhd8ed1ab_0 conda-forge
openh264 2.1.1 h780b84a_0 conda-forge
openjpeg 2.4.0 hb52868f_1 conda-forge
openssl 1.1.1o h166bdaf_0 conda-forge
opt_einsum 3.3.0 pyhd8ed1ab_1 conda-forge
osmesa 12.2.2.dev 0 menpo
packaging 21.3 pyhd8ed1ab_0 conda-forge
pandas 1.4.2 pypi_0 pypi
pandoc 2.18 ha770c72_0 conda-forge
pandocfilters 1.5.0 pyhd8ed1ab_0 conda-forge
parso 0.8.3 pyhd8ed1ab_0 conda-forge
patchelf 0.14.5.0 pypi_0 pypi
pcre 8.45 h9c3ff4c_0 conda-forge
pexpect 4.8.0 pyh9f0ad1d_2 conda-forge
pickleshare 0.7.5 py_1003 conda-forge
pillow 9.1.1 py39hae2aec6_0 conda-forge
pip 22.1.1 pyhd8ed1ab_0 conda-forge
pluggy 1.0.0 pypi_0 pypi
portaudio 19.6.0 h57a0ea0_5 conda-forge
proglog 0.1.10 pypi_0 pypi
prometheus_client 0.14.1 pyhd8ed1ab_0 conda-forge
prompt-toolkit 3.0.29 pyha770c72_0 conda-forge
prompt_toolkit 3.0.29 hd8ed1ab_0 conda-forge
protobuf 3.19.4 pypi_0 pypi
psutil 5.9.1 py39hb9d737c_0 conda-forge
pthread-stubs 0.4 h36c2ea0_1001 conda-forge
ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge
pulseaudio 14.0 h583eb01_5 conda-forge
pure_eval 0.2.2 pyhd8ed1ab_0 conda-forge
py 1.11.0 pypi_0 pypi
pyasn1 0.4.8 py_0 conda-forge
pyasn1-modules 0.2.8 pypi_0 pypi
pycparser 2.21 pyhd8ed1ab_0 conda-forge
pygame 2.1.2 pypi_0 pypi
pyglet 1.5.26 pypi_0 pypi
pygments 2.12.0 pyhd8ed1ab_0 conda-forge
pyjwt 2.4.0 pyhd8ed1ab_0 conda-forge
pyopengl 3.1.6 pypi_0 pypi
pyopenssl 22.0.0 pyhd8ed1ab_0 conda-forge
pyparsing 2.4.7 pypi_0 pypi
pyqt 5.15.4 py39h18e9c17_1 conda-forge
pyqt5-sip 12.9.0 py39h5a03fae_1 conda-forge
pyrender 0.1.45 pypi_0 pypi
pyrsistent 0.18.1 py39hb9d737c_1 conda-forge
pysocks 1.7.1 py39hf3d152e_5 conda-forge
pytest 7.1.2 pypi_0 pypi
python 3.9.13 h9a8a25e_0_cpython conda-forge
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
python-fastjsonschema 2.15.3 pyhd8ed1ab_0 conda-forge
python-flatbuffers 2.0 pyhd8ed1ab_0 conda-forge
python_abi 3.9 2_cp39 conda-forge
pytorch 1.13.0.dev20220531 py3.9_cuda11.3_cudnn8.3.2_0 pytorch-nightly
pytorch-mutex 1.0 cuda pytorch-nightly
pytz 2022.1 pypi_0 pypi
pyu2f 0.1.5 pyhd8ed1ab_0 conda-forge
pyyaml 6.0 pypi_0 pypi
pyzmq 23.0.0 pypi_0 pypi
qt-main 5.15.4 ha5833f6_1 conda-forge
qtconsole 5.3.1 pyhd8ed1ab_0 conda-forge
qtconsole-base 5.3.1 pyha770c72_0 conda-forge
qtpy 2.1.0 pyhd8ed1ab_0 conda-forge
readline 8.1 h46c0cb4_0 conda-forge
requests 2.27.1 pyhd8ed1ab_0 conda-forge
requests-oauthlib 1.3.1 pyhd8ed1ab_0 conda-forge
rsa 4.8 pyhd8ed1ab_0 conda-forge
scipy 1.8.1 pypi_0 pypi
send2trash 1.8.0 pyhd8ed1ab_0 conda-forge
setuptools 62.3.2 py39hf3d152e_0 conda-forge
sip 6.5.1 py39he80948d_2 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
sk-video 1.1.10 pypi_0 pypi
sniffio 1.2.0 pypi_0 pypi
soupsieve 2.3.2.post1 pypi_0 pypi
sqlite 3.38.5 h4ff8645_0 conda-forge
stack_data 0.2.0 pyhd8ed1ab_0 conda-forge
submitit 1.4.2 pypi_0 pypi
svt-av1 1.1.0 h27087fc_1 conda-forge
tensorboard 2.9.1 pypi_0 pypi
tensorboard-data-server 0.6.1 pypi_0 pypi
tensorboard-plugin-wit 1.8.1 pyhd8ed1ab_0 conda-forge
tensorflow 2.4.1 mkl_py39h4683426_0
tensorflow-base 2.4.1 mkl_py39h43e0292_0
tensorflow-estimator 2.6.0 py39he80948d_0 conda-forge
termcolor 1.1.0 py_2 conda-forge
terminado 0.15.0 py39hf3d152e_0 conda-forge
tinycss2 1.1.1 pyhd8ed1ab_0 conda-forge
tk 8.6.12 h27826a3_0 conda-forge
toml 0.10.2 pyhd8ed1ab_0 conda-forge
tomli 2.0.1 pypi_0 pypi
torch-tb-profiler 0.4.0 pypi_0 pypi
torchaudio 0.12.0.dev20220531 py39_cu113 pytorch-nightly
torchrl 0.1 dev_0 <develop>
torchvision 0.14.0.dev20220531 py39_cu113 pytorch-nightly
tornado 6.1 py39hb9d737c_3 conda-forge
tqdm 4.64.0 pypi_0 pypi
traitlets 5.2.1.post0 pypi_0 pypi
trimesh 3.12.6 pypi_0 pypi
typing-extensions 4.2.0 hd8ed1ab_1 conda-forge
typing_extensions 4.2.0 pyha770c72_1 conda-forge
tzdata 2022a h191b570_0 conda-forge
urllib3 1.26.9 pyhd8ed1ab_0 conda-forge
wcwidth 0.2.5 pyh9f0ad1d_2 conda-forge
webencodings 0.5.1 pypi_0 pypi
websocket-client 1.3.2 pypi_0 pypi
werkzeug 2.1.2 pyhd8ed1ab_1 conda-forge
wheel 0.37.1 pyhd8ed1ab_0 conda-forge
widgetsnbextension 3.6.0 py39hf3d152e_0 conda-forge
wrapt 1.14.1 py39hb9d737c_0 conda-forge
x264 1!161.3030 h7f98852_1 conda-forge
x265 3.5 h924138e_3 conda-forge
xorg-damageproto 1.2.1 h7f98852_1002 conda-forge
xorg-fixesproto 5.0 h7f98852_1002 conda-forge
xorg-glproto 1.4.17 h7f98852_1002 conda-forge
xorg-kbproto 1.0.7 h7f98852_1002 conda-forge
xorg-libx11 1.7.2 h7f98852_0 conda-forge
xorg-libxau 1.0.9 h7f98852_0 conda-forge
xorg-libxcursor 1.2.0 h7f98852_0 conda-forge
xorg-libxdamage 1.1.5 h7f98852_1 conda-forge
xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge
xorg-libxext 1.3.4 h7f98852_1 conda-forge
xorg-libxfixes 5.0.3 h7f98852_1004 conda-forge
xorg-libxinerama 1.1.4 h9c3ff4c_1001 conda-forge
xorg-libxrandr 1.5.2 h7f98852_1 conda-forge
xorg-libxrender 0.9.10 h7f98852_1003 conda-forge
xorg-randrproto 1.5.0 h7f98852_1001 conda-forge
xorg-renderproto 0.11.1 h7f98852_1002 conda-forge
xorg-util-macros 1.19.3 h7f98852_0 conda-forge
xorg-xextproto 7.3.0 h7f98852_1002 conda-forge
xorg-xf86vidmodeproto 2.3.1 h7f98852_1002 conda-forge
xorg-xproto 7.0.31 h7f98852_1007 conda-forge
xz 5.2.5 h516909a_1 conda-forge
yarl 1.7.2 py39hb9d737c_2 conda-forge
zeromq 4.3.4 h9c3ff4c_1 conda-forge
zipp 3.8.0 pyhd8ed1ab_0 conda-forge
zlib 1.2.12 h166bdaf_0 conda-forge
zstd 1.5.2 h8a70e8d_1 conda-forge
I have a similar problem. Sometimes it renders images correctly, but sometimes it renders only the background image (see the video). This issue is non-deterministic, and the video might be rendered correctly or incorrectly for the same seed.
OS: Ubuntu 20.04, MuJoCo version: 2.2.0 I use MUJOCO_GL=egl as well.
Can you please try running with DISABLE_RENDER_THREAD_OFFLOADING=1 (environment variable)?
I still get the same behaviour with DISABLE_RENDER_THREAD_OFFLOADING=1 :/
@saran-t DISABLE_RENDER_THREAD_OFFLOADING=1 doesn't resolve the problem for me either.
Can I please have a minimal repro code that I can run on my side?
@vmoens @ikostrikov Gentle nudge on the request for minimal repro above. We'd like to try to get to the bottom of this.
Hi @saran-t I've been trying hard to reproduce this but it seems to only happen after the code reaches a certain level of complexity (e.g. gpus are used for training and rendering, etc.) Would it be ok if I point you to a specific commit on torchrl, give you the precise conda env setting, the machine config etc for you to reproduce? It's going to be a bit messy but at least it's something!
If it's consistently reproducible, a messy repro case will be better than not having one at all, so please do give us that anyway.
Also, are you saying with like-for-like experiment complexity level, mujoco-py rendering does not break in the same way?
Here's one 0e88eac27f1d01bfa1d260d52c051ab5fe514859
Here's the command line
conda create -n mbrl_dmcontrol3 python=3.10
conda activate mbrl_dmcontrol3
pip install dm_control
module load cuda/11.6 nccl/2.12.7-cuda.11.6 nccl_efa/1.15.1-nccl.2.12.7-cuda.11.6
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
pip install functorch
pip install hydra-core
# from torchrl root:
python setup.py develop
cd examples/dreamer/
EGL_DEVICE_ID=2 MUJOCO_GL=egl CHECK_IMAGES=1 srun -p train --gpus-per-node 3 -c 32 python dreamer.py frame_skip=2 init_env_steps=10000 logger=csv
The CHECK_IMAGES=1 will make sure an error is raise as soon as an image is more than half black or white (ie render has collapsed)
You should see an error like this during the first test rollout:
Traceback (most recent call last):
File "/fsx/users/vmoens/work/rl_mb/examples/dreamer/dreamer.py", line 411, in main
call_record(logger, record, collected_frames, sampled_tensordict_save, stats, model_based_env, actor_model, cfg)
File "/fsx/users/vmoens/conda/envs/mbrl_dmcontrol3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/fsx/users/vmoens/work/rl_mb/examples/dreamer/dreamer.py", line 132, in call_record
td_record = record(None)
File "/fsx/users/vmoens/conda/envs/mbrl_dmcontrol3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/fsx/users/vmoens/work/rl_mb/torchrl/trainers/trainers.py", line 907, in __call__
td_record = self.recorder.rollout(
File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/common.py", line 503, in rollout
tensordict = self.reset()
File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/common.py", line 346, in reset
tensordict_reset = self._reset(tensordict, **kwargs)
File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/transforms/transforms.py", line 403, in _reset
out_tensordict = self.base_env.reset(execute_step=False, **kwargs)
File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/common.py", line 346, in reset
tensordict_reset = self._reset(tensordict, **kwargs)
File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/gym_like.py", line 122, in _reset
source=self._read_obs(obs),
File "/fsx/users/vmoens/work/rl_mb/torchrl/envs/gym_like.py", line 136, in _read_obs
observations = self.observation_spec.encode(observations)
File "/fsx/users/vmoens/work/rl_mb/torchrl/data/tensor_specs.py", line 1107, in encode
out[key] = self[key].encode(item)
File "/fsx/users/vmoens/work/rl_mb/torchrl/data/tensor_specs.py", line 243, in encode
assert v < 0.5, f"numpy: {val.shape}"
AssertionError: numpy: (240, 320, 3)
Please point me to where the rendering context is set up and where the multiprocessing occurs.
If it's consistently reproducible, a messy repro case will be better than not having one at all, so please do give us that anyway.
got it!
Also, are you saying with like-for-like experiment complexity level,
mujoco-pyrendering does not break in the same way?
let me rephrase: with one library where we used to rely on mujoco-py but switched to the new mujoco bindings, we have seen this issue appearing. I ran the following experiment using an old version of dm_control with torchrl and the issue disappears. Here's the setup
torchrl commit: 056699bd214937400c5cc7722669e7819a93bc1e
Setup:
conda create -n mbrl_olddmc python=3.9
conda activate mbrl_olddmc
pip install mujoco_py
pip install dm-control==0.0.403778684 # works with mujoco 210
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
pip install functorch
pip install hydra-core
cd path/to/torchrl
python setup.py develop
conda env config vars set MJLIB_PATH=/data/home/vmoens/.mujoco/mujoco210/bin/libmujoco210.so LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/home/vmoens/.mujoco/mujoco210/bin MUJOCO_GL=egl PYOPENGL_PLATFORM=egl MUJOCO_PY_MUJOCO_PATH=/data/home/vmoens/.mujo
co/mujoco210
conda deactivate && conda activate mbrl_olddmc
Command:
EGL_DEVICE_ID=2 MUJOCO_GL=egl CHECK_IMAGES=1 srun -p train --gpus-per-node 3 -c 32 python dreamer.py frame_skip=2 init_env_steps=10000 logger=csv env_per_collector=1 num_workers=1
Importantly:
- this code immediately fails on the env where I have installed the new dm_control version (with new bindings) but not with the old one
- this setup will not use any parallelism for rendering (a single process is launched), see
env_per_collector=1 num_workers=1 async_collection=Falsewhich tell our trainer to collect data on the same process where the training occurs. - On the same cluster (different nodes but same setup) the error occurs consistently at the same time.
For rendering, we use dm_control pixels wrapper. When executing a step we create a torch. Tensor from the numpy array and send it on device if needed.
In the example script I gave here above, we first run a random rollout in the environment to get statistics about the observation. To do that, we have a function that creates an environment instance, runs the rollout and calculates the stats. Then we run another random rollout to get data to pass to the model (to initialize it): we have lazy layers that take the right shape once they see real data. In this example, that's where the issue happens (not event during training).
I'm having trouble running python dreamer.py frame_skip=2 init_env_steps=10000 logger=csv on my machine.
Could you please make a repro script that just runs the dm_control environment without any agent in the loop, preferably without any dependency on Torch?
Note also that I don't have access to a SLURM cluster and I need to repro this on a local machine.
OK, I have this running. I have zero familiarity with this code, but it seems that Hydra is creating some sort of default cfg and is forcing cfg.collector_devices to be ['cuda:1', 'cuda:1']. On my machine with only a single GPU, this causes an "invalid ordinal" CUDA error.
I had to go into torchrl/trainers/helpers/envs.py and manually override device to 'cuda:0' which allows the script to run. However, now everything runs just fine and I cannot actually trigger the error.
Let me write a single-gpu example for you
I've managed to trigger the error. Still investigating, but it looks like something is copying the rendering context objects in Python, which isn't a supported operation.
@vmoens Could you please try https://github.com/saran-t/dm_control/pull/1 and see if it fixes your issue?
It is running in a much more stable way than it used to. No noisy pixel, and runs that used to collapse after a couple of iterations are now running smoothly. For me this can be considered as closed. Thanks so much for your help @saran-t! This is amazing
I'll have this fixed in our 1.0.6 release later this week.
This should now be fixed in version 1.0.6.