xla icon indicating copy to clipboard operation
xla copied to clipboard

[2.5 release] GPU docker image failed to run mnist test

Open ManfeiBai opened this issue 4 months ago • 2 comments

🐛 Bug

new built GPU docker image for PyTorch/XLA 2.5 with r2.5 branch, passed import torch_xla, passed PJRT_DEVICE=CPU python test/test_train_mp_mnist.py, failed at mnist test with PJRT_DEVICE=CUDA: https://gist.github.com/ManfeiBai/f9efab9ce534970b7d9537006ff50a1a

  • 8 GPU:

    • cmd: GPU_NUM_DEVICES=8 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py
    • error: Failed to shut down the distributed runtime client.torch_xla/csrc/runtime/xla_coordinator.cc:48 : Check failed: dist_runtime_client_->Shutdown().ok()
  • 1 GPU:

    • cmd: GPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py
    • error: RuntimeError: Bad StatusOr access: FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details.

To Reproduce

Steps to reproduce the behavior:

  1. get a GPU
  2. create a new docker container with testing GPU docker image us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.5.0-rc1_3.9_cuda_12.1 bin/bash:
  • cmd: sudo docker run --shm-size=16G --gpus all --name netnenewnewnewr25py39 --network host -it -d us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.5.0-rc1_3.9_cuda_12.1 bin/bash
  1. install PyTorch/XLA repo:
  • cmd: git clone -b r2.5 https://github.com/pytorch/xla.git
  1. change path to PyTorch/XLA repo:
  • cmd: cd xla
  1. run mnist test with PJRT_DEVICE=CUDA:
  • cmd: GPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py or GPU_NUM_DEVICES=8 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py or GPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python test/test_train_mp_mnist.py --num_epochs 2

Environment

  • Reproducible on XLA backend [CPU/TPU/CUDA]: CUDA
  • torch_xla version: tag: v2.5.0-rc1
  • GPU type: V100
  • GCP info: IMAGE_FAMILY=pytorch-1-12-cu113
  • GCP info: COUNT=4

ManfeiBai avatar Sep 27 '24 23:09 ManfeiBai