accelerate
accelerate copied to clipboard
Add XLA unit tests to pre-submit CI
System Info
Hi team,
Can we add some unit test for XLA (GPU, TPU v4, TPU v2 or v3) to the pre-submit CI? The test can be as simple as accelerate test. The reason for the request is that we have observed a few changes recently from accelerate that broke the accelerate test for TPU such as https://github.com/huggingface/accelerate/pull/2319 and https://github.com/huggingface/accelerate/pull/2176). It takes longer for PyTorch/XLA team to fix them because PyTorch/XLA team is not familiar with the change. And it will be great if the PR author can fix the issue before the PR is merged as they have the most context, so that the users won't see the regression. Thanks!
cc @will-cromar, @JackCaoG, @muellerzr
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [ ] My own task or dataset (give details below)
Reproduction
config:
compute_environment: LOCAL_MACHINE
distributed_type: XLA
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
export PJRT_DEVICE=TPU accelerate test
Expected behavior
na
The issue is we don't run runners on pre-submit CI, nor do we have any TPUs to run on merge CI, so we have no way of testing TPUs ourselves with accelerate outside running it manually in Colab.
(We could maybe look at adding GPU XLA tests in there though post-submit)
Note: we also don't run GPU runners on pre-submit, only the main CI has access to those + nightlies
Thanks for the response. In that case, can we add a GPU XLA tests through post-submit? That will help catch issues earlier.
Certainly.
IIUC all that's needed to get this going is to install torch_xla currently with the state of accelerate, correct? If so then we just need to setup a new xla-gpu specific docker image to call and run the tests on a similar CI runner to what we have now.
Certainly.
IIUC all that's needed to get this going is to install
torch_xlacurrently with the state of accelerate, correct? If so then we just need to setup a new xla-gpu specific docker image to call and run the tests on a similar CI runner to what we have now.
That's correct. Thanks!
@vanbasten23 do you have a good "hello world" test that can be run on the GPU docker images to check and see if everything works okay? Hitting a few snags just doing accelerate test, and I can't seem to get things working despite doing PJRT_DEVICE=CUDA.
Docker file I'm testing:
FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.2.0_3.10_cuda_12.1
RUN python3 -m pip install --no-cache-dir \
git+https://github.com/huggingface/accelerate#egg=accelerate[test_prod,test_integrations] \
--extra-index-url https://download.pytorch.org/whl/cu117
# Activate the virtualenv
CMD ["/bin/bash"]
Yes. You can use this: PJRT_DEVICE=CUDA python
import torch, torch_xla
import torch_xla.core.xla_model as xm
t1 = torch.randn(1, 128, device='cpu')
t2 = torch.randn(1, 128, device='cpu')
xt1 = t1.to(xm.xla_device())
xt2 = t2.to(xm.xla_device())
expected = t1 + t2
actual = (xt1 + xt2).cpu()
assert torch.allclose(expected, actual)
BTW just noticing this, we should eventually change the logic so PJRT_DEVICE is auto-set if multi-gpu is enabled through the config file + torch_xla is available.
Running PJRT_DEVICE=CUDA accelerate test eventually leaves me with this trace:
2024-03-13 17:12:13.441463: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.
2024-03-13 17:12:13.441551: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2716] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.; current tracing scope: all-reduce-start; current profiling annotation: XlaModule:#hlo_module=SyncTensorsGraph.25,program_id=0#.
2024-03-13 17:12:13.441641: W external/xla/xla/service/gpu/runtime/support.cc:58] Intercepted XLA runtime error:
INTERNAL: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.
2024-03-13 17:12:13.441712: E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2716] Execution of replica 1 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.; current tracing scope: all-reduce-start; current profiling annotation: XlaModule:#hlo_module=SyncTensorsGraph.25,program_id=0#.
Traceback (most recent call last):
File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 716, in <module>
Traceback (most recent call last):
File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 716, in <module>
main()
File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 664, in main
main()
File "/mnt/accelerate/src/accelerate/test_utils/scripts/test_script.py", line 664, in main
state.wait_for_everyone()
File "/mnt/accelerate/src/accelerate/state.py", line 934, in wait_for_everyone
state.wait_for_everyone()
File "/mnt/accelerate/src/accelerate/state.py", line 934, in wait_for_everyone
PartialState().wait_for_everyone()
File "/mnt/accelerate/src/accelerate/state.py", line 423, in wait_for_everyone
xm.rendezvous("accelerate.utils.wait_for_everyone")
File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1166, in rendezvous
PartialState().wait_for_everyone()
File "/mnt/accelerate/src/accelerate/state.py", line 423, in wait_for_everyone
xm.rendezvous("accelerate.utils.wait_for_everyone")
File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1166, in rendezvous
return xla_rendezvous(payload, replicas or None, tag=tag)
File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1133, in xla_rendezvous
return xla_rendezvous(payload, replicas or None, tag=tag)
File "/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py", line 1133, in xla_rendezvous
if max_size.item() < 1:
RuntimeError: Bad StatusOr access: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.all_reduce' failed: external/xla/xla/service/gpu/nccl_utils.cc:305: NCCL operation ncclCommInitRank(&comm, nranks, id, rank) failed: unhandled cuda error (run with NCCL_DEBUG=INFO for details). Last NCCL warning(error) log entry (may be unrelated) 'Cuda failure 'out of memory''.; current tracing scope: all-reduce-start; current profiling annotation: XlaModule:#hlo_module=SyncTensorsGraph.25,program_id=0#.
Any clue what's going on there? It should certainly not be running out of memory with 2x24gb GPUs and I set --shm-size="48gb"
If we can get to a point where I can run them locally via Docker and things make sense on a CUDA runtime, then we can integrate it into a CI.
BTW just noticing this, we should eventually change the logic so PJRT_DEVICE is auto-set if multi-gpu is enabled through the config file + torch_xla is available.
Completed agreed.
Any clue what's going on there?
Do you know your cuda runtime version (nvcc --version)? I'm using cuda 12.1 and I got a different error which it seems it accessed the XLA devices before calling the spawn.
I rebase my codebase to get the latest code on the main branch and here is the new error that I got. It fails at https://github.com/huggingface/accelerate/blob/2ad42e77c3a1993dbfb9bc299c21bae2005c0572/src/accelerate/test_utils/scripts/test_script.py#L751. Looking like it failed at a later place in the test_script.py than the one in https://github.com/huggingface/accelerate/issues/2545#issuecomment-1995056948.
Yes, that random sampler part I mentioned that could be bad in this PR 😉 https://github.com/huggingface/accelerate/pull/2542#discussion_r1520270882
I reverted the change locally in https://github.com/huggingface/accelerate/pull/2542/files#diff-d9858283a2ced902233727f6fddde0a00831ad9a66a069e57231a5057d550bf6 and I still got the same error.
Hmm okay I'll try giving it a look tommorow.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.