xla
xla copied to clipboard
[2.5 release] GPU docker image failed to run mnist test
🐛 Bug
new built GPU docker image for PyTorch/XLA 2.5 with r2.5
branch, passed import torch_xla
, passed PJRT_DEVICE=CPU python test/test_train_mp_mnist.py
, failed at mnist test with PJRT_DEVICE=CUDA
: https://gist.github.com/ManfeiBai/f9efab9ce534970b7d9537006ff50a1a
-
8 GPU
:- cmd:
GPU_NUM_DEVICES=8 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py
- error:
Failed to shut down the distributed runtime client.torch_xla/csrc/runtime/xla_coordinator.cc:48 : Check failed: dist_runtime_client_->Shutdown().ok()
- cmd:
-
1 GPU
:- cmd:
GPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py
- error:
RuntimeError: Bad StatusOr access: FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details.
- cmd:
To Reproduce
Steps to reproduce the behavior:
- get a GPU
- create a new docker container with testing GPU docker image
us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.5.0-rc1_3.9_cuda_12.1 bin/bash
:
- cmd:
sudo docker run --shm-size=16G --gpus all --name netnenewnewnewr25py39 --network host -it -d us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.5.0-rc1_3.9_cuda_12.1 bin/bash
- install PyTorch/XLA repo:
- cmd:
git clone -b r2.5 https://github.com/pytorch/xla.git
- change path to PyTorch/XLA repo:
- cmd:
cd xla
- run mnist test with
PJRT_DEVICE=CUDA
:
- cmd:
GPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py
orGPU_NUM_DEVICES=8 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py
orGPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py --num_epochs 2
Environment
- Reproducible on XLA backend [CPU/TPU/CUDA]: CUDA
- torch_xla version: tag: v2.5.0-rc1
- GPU type: V100
- GCP info: IMAGE_FAMILY=pytorch-1-12-cu113
- GCP info: COUNT=4