dreamerv3
dreamerv3 copied to clipboard
XlaRuntimeError: INTERNAL: RET_CHECK failure
Hi, I try to run the code in docker. Unfortunately, I get a JAX-related error:
UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: RET_CHECK failure
(external/xla/xla/service/gpu/gemm_algorithm_picker.cc:380)
stream->parent()->GetBlasGemmAlgorithms(stream, &algorithms)
Steps to reproduce:
Install NVidia Container Toolkit https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
I changed line 33 in the Dockerfile to
COPY dreamerv3/embodied/scripts scripts
Create docker image and run container
docker build -f dreamerv3/Dockerfile -t dreamer-v3:$USER . && \
docker run -it --rm --gpus all -v ~/logdir:/logdir dreamer-v3:$USER \
sh ../scripts/xvfb_run.sh python3 dreamerv3/train.py \
--logdir "/logdir/$(date +%Y%m%d-%H%M%S)" \
--configs atari small --task atari_pong
My local nvida-smi output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01 Driver Version: 525.78.01 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 4000 Off | 00000000:01:00.0 On | N/A |
| 30% 30C P8 10W / 125W | 995MiB / 8192MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
The output of the provided nvidia docker test:
docker run -it --rm --gpus all nvidia/cuda:11.4.2-cudnn8-runtime-ubuntu20.04 nvidia-smi
==========
== CUDA ==
==========
CUDA Version 11.4.2
Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md
Tue May 2 12:15:16 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01 Driver Version: 525.78.01 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 4000 Off | 00000000:01:00.0 On | N/A |
| 30% 30C P8 11W / 125W | 995MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Do you think I have to change my local Cuda Version in order to get Dreamer-v3 inside the container running correctly?
Hi, you can also use a Docker base image with newer CUDA version. The algorithm supports the newest JAX/CUDA versions. Hope that helps!