XlaRuntimeError: INTERNAL: RET_CHECK failure

Open JakobThumm opened this issue 2 years ago • 1 comments

Hi, I try to run the code in docker. Unfortunately, I get a JAX-related error:

UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: RET_CHECK failure 
(external/xla/xla/service/gpu/gemm_algorithm_picker.cc:380) 
stream->parent()->GetBlasGemmAlgorithms(stream, &algorithms)

Steps to reproduce: Install NVidia Container Toolkit https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html I changed line 33 in the Dockerfile to

COPY dreamerv3/embodied/scripts scripts

Create docker image and run container

docker build -f  dreamerv3/Dockerfile -t dreamer-v3:$USER . && \
 docker run -it --rm --gpus all -v ~/logdir:/logdir dreamer-v3:$USER \
   sh ../scripts/xvfb_run.sh python3 dreamerv3/train.py \
   --logdir "/logdir/$(date +%Y%m%d-%H%M%S)" \
   --configs atari small --task atari_pong

My local nvida-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 4000     Off  | 00000000:01:00.0  On |                  N/A |
| 30%   30C    P8    10W / 125W |    995MiB /  8192MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

The output of the provided nvidia docker test:

docker run -it --rm --gpus all nvidia/cuda:11.4.2-cudnn8-runtime-ubuntu20.04 nvidia-smi

==========
== CUDA ==
==========

CUDA Version 11.4.2

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

Tue May  2 12:15:16 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 4000     Off  | 00000000:01:00.0  On |                  N/A |
| 30%   30C    P8    11W / 125W |    995MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Do you think I have to change my local Cuda Version in order to get Dreamer-v3 inside the container running correctly?

May 02 '23 12:05 JakobThumm

Hi, you can also use a Docker base image with newer CUDA version. The algorithm supports the newest JAX/CUDA versions. Hope that helps!

May 04 '23 03:05 danijar