xla [2.5 release] GPU docker image failed to run mnist test

[2.5 release] GPU docker image failed to run mnist test

Open ManfeiBai opened this issue 4 months ago • 2 comments

🐛 Bug

new built GPU docker image for PyTorch/XLA 2.5 with r2.5 branch, passed import torch_xla, passed PJRT_DEVICE=CPU python test/test_train_mp_mnist.py, failed at mnist test with PJRT_DEVICE=CUDA: https://gist.github.com/ManfeiBai/f9efab9ce534970b7d9537006ff50a1a

8 GPU:
- cmd: GPU_NUM_DEVICES=8 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py
- error: Failed to shut down the distributed runtime client.torch_xla/csrc/runtime/xla_coordinator.cc:48 : Check failed: dist_runtime_client_->Shutdown().ok()
1 GPU:
- cmd: GPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py
- error: RuntimeError: Bad StatusOr access: FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details.

To Reproduce

Steps to reproduce the behavior:

get a GPU
create a new docker container with testing GPU docker image us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.5.0-rc1_3.9_cuda_12.1 bin/bash:

cmd: sudo docker run --shm-size=16G --gpus all --name netnenewnewnewr25py39 --network host -it -d us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.5.0-rc1_3.9_cuda_12.1 bin/bash

install PyTorch/XLA repo:

cmd: git clone -b r2.5 https://github.com/pytorch/xla.git

change path to PyTorch/XLA repo:

cmd: cd xla

run mnist test with PJRT_DEVICE=CUDA:

cmd: GPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py or GPU_NUM_DEVICES=8 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py or GPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py --num_epochs 2

Environment

Reproducible on XLA backend [CPU/TPU/CUDA]: CUDA
torch_xla version: tag: v2.5.0-rc1
GPU type: V100
GCP info: IMAGE_FAMILY=pytorch-1-12-cu113
GCP info: COUNT=4

Sep 27 '24 23:09 ManfeiBai

xla xla copied to clipboard

[2.5 release] GPU docker image failed to run mnist test

🐛 Bug

To Reproduce

Environment

xla
xla copied to clipboard