accelerate Add XLA unit tests to pre-submit CI

System Info

Hi team,

Can we add some unit test for XLA (GPU, TPU v4, TPU v2 or v3) to the pre-submit CI? The test can be as simple as accelerate test. The reason for the request is that we have observed a few changes recently from accelerate that broke the accelerate test for TPU such as https://github.com/huggingface/accelerate/pull/2319 and https://github.com/huggingface/accelerate/pull/2176). It takes longer for PyTorch/XLA team to fix them because PyTorch/XLA team is not familiar with the change. And it will be great if the PR author can fix the issue before the PR is merged as they have the most context, so that the users won't see the regression. Thanks!

cc @will-cromar, @JackCaoG, @muellerzr

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

config:

compute_environment: LOCAL_MACHINE
distributed_type: XLA
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

export PJRT_DEVICE=TPU accelerate test

Expected behavior

na

Mar 11 '24 18:03 vanbasten23

The issue is we don't run runners on pre-submit CI, nor do we have any TPUs to run on merge CI, so we have no way of testing TPUs ourselves with accelerate outside running it manually in Colab.

(We could maybe look at adding GPU XLA tests in there though post-submit)

Note: we also don't run GPU runners on pre-submit, only the main CI has access to those + nightlies

Mar 11 '24 18:03 muellerzr

Thanks for the response. In that case, can we add a GPU XLA tests through post-submit? That will help catch issues earlier.

Mar 11 '24 20:03 vanbasten23

Certainly.

IIUC all that's needed to get this going is to install torch_xla currently with the state of accelerate, correct? If so then we just need to setup a new xla-gpu specific docker image to call and run the tests on a similar CI runner to what we have now.

Mar 11 '24 21:03 muellerzr

Certainly.

IIUC all that's needed to get this going is to install torch_xla currently with the state of accelerate, correct? If so then we just need to setup a new xla-gpu specific docker image to call and run the tests on a similar CI runner to what we have now.

That's correct. Thanks!

Mar 11 '24 22:03 vanbasten23

@vanbasten23 do you have a good "hello world" test that can be run on the GPU docker images to check and see if everything works okay? Hitting a few snags just doing accelerate test, and I can't seem to get things working despite doing PJRT_DEVICE=CUDA.

Docker file I'm testing:

FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.2.0_3.10_cuda_12.1
RUN python3 -m pip install --no-cache-dir \
    git+https://github.com/huggingface/accelerate#egg=accelerate[test_prod,test_integrations] \
    --extra-index-url https://download.pytorch.org/whl/cu117

# Activate the virtualenv
CMD ["/bin/bash"]

Mar 13 '24 16:03 muellerzr

Yes. You can use this: PJRT_DEVICE=CUDA python

import torch, torch_xla
import torch_xla.core.xla_model as xm

t1 = torch.randn(1, 128, device='cpu')
t2 = torch.randn(1, 128, device='cpu')

xt1 = t1.to(xm.xla_device())
xt2 = t2.to(xm.xla_device())

expected = t1 + t2
actual = (xt1 + xt2).cpu()
assert torch.allclose(expected, actual)

Mar 13 '24 16:03 vanbasten23

BTW just noticing this, we should eventually change the logic so PJRT_DEVICE is auto-set if multi-gpu is enabled through the config file + torch_xla is available.

Running PJRT_DEVICE=CUDA accelerate test eventually leaves me with this trace:

2024-03-13 17:12:13.441463: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.
2024-03-13 17:12:13.441551: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2716] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.; current tracing scope: all-reduce-start; current profiling annotation: XlaModule:#hlo_module=SyncTensorsGraph.25,program_id=0#.
2024-03-13 17:12:13.441641: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.
2024-03-13 17:12:13.441712: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2716] Execution of replica 1 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.; current tracing scope: all-reduce-start; current profiling annotation: XlaModule:#hlo_module=SyncTensorsGraph.25,program_id=0#.
Traceback (most recent call last):
  File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 716, in <module>
Traceback (most recent call last):
  File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 716, in <module>
    main()
  File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 664, in main
    main()
  File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 664, in main
    state.wait_for_everyone()
  File "/mnt/accelerate/src/accelerate/state.py", line 934, in wait_for_everyone
    state.wait_for_everyone()
  File "/mnt/accelerate/src/accelerate/state.py", line 934, in wait_for_everyone
    PartialState().wait_for_everyone()
  File "/mnt/accelerate/src/accelerate/state.py", line 423, in wait_for_everyone
    xm.rendezvous("accelerate.utils.wait_for_everyone")
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1166, in rendezvous
    PartialState().wait_for_everyone()
  File "/mnt/accelerate/src/accelerate/state.py", line 423, in wait_for_everyone
    xm.rendezvous("accelerate.utils.wait_for_everyone")
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1166, in rendezvous
    return xla_rendezvous(payload, replicas or None, tag=tag)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1133, in xla_rendezvous
    return xla_rendezvous(payload, replicas or None, tag=tag)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1133, in xla_rendezvous
    if max_size.item() < 1:
RuntimeError: Bad StatusOr access: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.; current tracing scope: all-reduce-start; current profiling annotation: XlaModule:#hlo_module=SyncTensorsGraph.25,program_id=0#.

Any clue what's going on there? It should certainly not be running out of memory with 2x24gb GPUs and I set --shm-size="48gb"

Mar 13 '24 17:03 muellerzr

If we can get to a point where I can run them locally via Docker and things make sense on a CUDA runtime, then we can integrate it into a CI.

Mar 13 '24 17:03 muellerzr

BTW just noticing this, we should eventually change the logic so PJRT_DEVICE is auto-set if multi-gpu is enabled through the config file + torch_xla is available.

Completed agreed.

Any clue what's going on there?

Do you know your cuda runtime version (nvcc --version)? I'm using cuda 12.1 and I got a different error which it seems it accessed the XLA devices before calling the spawn.

Mar 13 '24 18:03 vanbasten23

I rebase my codebase to get the latest code on the main branch and here is the new error that I got. It fails at https://github.com/huggingface/accelerate/blob/2ad42e77c3a1993dbfb9bc299c21bae2005c0572/src/accelerate/test_utils/scripts/test_script.py#L751. Looking like it failed at a later place in the test_script.py than the one in https://github.com/huggingface/accelerate/issues/2545#issuecomment-1995056948.

Mar 18 '24 22:03 vanbasten23

Yes, that random sampler part I mentioned that could be bad in this PR 😉 https://github.com/huggingface/accelerate/pull/2542#discussion_r1520270882

Mar 18 '24 22:03 muellerzr

I reverted the change locally in https://github.com/huggingface/accelerate/pull/2542/files#diff-d9858283a2ced902233727f6fddde0a00831ad9a66a069e57231a5057d550bf6 and I still got the same error.

Mar 18 '24 22:03 vanbasten23

Hmm okay I'll try giving it a look tommorow.

Mar 19 '24 00:03 muellerzr

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 12 '24 15:04 github-actions[bot]