Using accelerate launch to initialize sagemaker job doesn't work properly with multiple GPUs
System Info
- `Accelerate` version: 0.30.1
- Platform: Linux-6.8.0-1015-aws-x86_64-with-glibc2.31
- `accelerate` bash location: /opt/conda/bin/accelerate
- Python version: 3.11.9
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 30.98 GB
- `Accelerate` default config:
Not found
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [X] My own task or dataset (give details below)
Reproduction
! Please note that the system info above does not reflect the actual environment accelerate runs in on Sagemaker. The above config is generated in a Sagemaker official container.
To reproduce the bug:
- Create any training script that invokes accelerator.gather()
- Configure accelerate to run on a Sagemaker multi-gpu machine using
accelerate config, use 209479262201.dkr.ecr.us-west-2.amazonaws.com/1xgpt-from-sagemaker:2.3.0 as your docker image - Create a training job using
accelerate launchand run the training script
Expected behavior
Sagemaker will return an error somewhere along the lines of this:
File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2373, in gather_for_metrics
data = self.gather(input_data)
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2329, in gather
return gather(tensor)
^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/operations.py", line 380, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/operations.py", line 441, in gather
return _gpu_gather(tensor)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/operations.py", line 360, in _gpu_gather
return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/operations.py", line 126, in recursively_apply
return func(data, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/accelerate/utils/operations.py", line 350, in _gpu_gather_one
gather_op(output_tensors, tensor)
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2948, in all_gather_into_tensor
work = group._allgather_base(output_tensor, input_tensor, opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: SMDDP does not support: _allgather_base
If accelerate launch is invoked inside of sagemaker instead of used to create the sagemaker job, the script works fine. I suspect this is because MPI is not well-supported by sagemaker yet accelerate launch uses MPI
Yes, I'd recommend invoking inside of sagemaker instead in this case. (Though MPI should only be ran on CPU, not GPU)
Yes, I'd recommend invoking inside of sagemaker instead in this case. (Though MPI should only be ran on CPU, not GPU)
Sorry if I wasn't clear in my original report. This is more of a complaint on the default behavior of of accelerate launch when configured to run on SageMaker. When I followed this guide to configure and run accelerate with SageMaker's, it defaulted to MPI, which doesn't work with distributed training on SageMaker. accelerate luanch should default to NCCL when configured to run distributed training on SageMaker.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @BaldPulse
Facing same Issue launching sagemaker training jobs using accelerate also getting RuntimeError: SMDDP does not support: _allgather_base
Can you explain what you did ?
@eyal-converge I presume you followed the instructions here and used accelerate to launch a Sagemaker training job. The solution is instead of launching a job with accelerate, launch a job first and invoke accelerate in the entry script.
You can checkout this file and the repo for an example.
@eyal-converge I presume you followed the instructions here and used accelerate to launch a Sagemaker training job. The solution is instead of launching a job with accelerate, launch a job first and invoke accelerate in the entry script.
You can checkout this file and the repo for an example.
This is INSANE .. how did you find this ? I'll dig deeper In the example you sent - many thanks
I guess I was too naive following the official docs assuming It works ..