transformers
transformers copied to clipboard
DeepSpeed ZeRO3 errors on config initialization
System Info
transformers-cli env
:
-
transformers
version: 4.37.2 - Platform: Linux-6.2.0-1017-aws-x86_64-with-glibc2.31
- Python version: 3.9.18
- Huggingface_hub version: 0.19.4
- Safetensors version: 0.4.1
- Accelerate version: 0.26.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- Tensorflow version (GPU?): 2.11.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.7.2 (cpu)
- Jax version: 0.4.13
- JaxLib version: 0.4.13
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes
Relevant Dependencies:
accelerate==0.26.1
deepspeed==0.12.3
ray==2.9.1
transformers==4.37.2
Who can help?
@pacman100 @muellerzr
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
I'm running the following script on a g4dn.12xlarge
instance.
import torch.distributed
from transformers import AutoModel, TrainingArguments
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
def train_func():
assert torch.distributed.is_initialized(), "Torch Distributed must be initialized."
deepspeed_config = {
"zero_optimization": {
"stage": 3,
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
}
train_args = TrainingArguments(
output_dir="./",
deepspeed=deepspeed_config,
)
model = AutoModel.from_pretrained("bert-base-uncased")
trainer = TorchTrainer(
train_loop_per_worker=train_func,
scaling_config=ScalingConfig(
num_workers=2,
use_gpu=True,
)
)
trainer.fit()
This errors with:
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 118, in discard_return_wrapper
train_func(*args, **kwargs)
File "/home/ray/default/simple.py", line 22, in train_func
model = AutoModel.from_pretrained("bert-base-uncased")
File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
return model_class.from_pretrained(
File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3583, in from_pretrained
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 859, in __init__
_ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 781, in __init__
self._configure_train_batch_size()
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 959, in _configure_train_batch_size
self._batch_assertion()
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 907, in _batch_assertion
assert train_batch == micro_batch * grad_acc * self.world_size, (
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 16 != 8 * 1 * 1
I did some debugging and it seems like world_size
is being set to 1 because dist
is not initialized yet here.
I also did some bisection and saw that the error started occurring in transformers==4.30.0
Related Issues:
- https://github.com/microsoft/DeepSpeed/issues/3341 - this seems to be the exact same issue, but I haven't looked deep enough to understand if the issue lies in DeepSpeed or Transformers or Accelerate.
Expected behavior
The script should run without error and DeepSpeed
distributed environment should be inherited from the existing Torch process group.
The issue does not occur if I use ZeRO2.
"zero_optimization": {
- "stage": 3,
+ "stage": 2,
},
The issue can also be mitigated by manually initializing the DeepSpeed distributed environment with deepspeed.init_distributed()
.
cc @pacman100 and @SunMarc
Hello @pacman100 @SunMarc could you review this issue? Thanks so much!
Thank you @matthewdeng for raising the issue, I am unfamiliar with ray
but looking into this.
Oops, sorry for including that part. The same behavior can be seen with torchrun
.
script.py
:
import torch.distributed
from transformers import AutoModel, TrainingArguments
torch.distributed.init_process_group(backend="nccl")
deepspeed_config = {
"zero_optimization": {
"stage": 3,
},
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
}
train_args = TrainingArguments(
output_dir="./",
deepspeed=deepspeed_config,
)
model = AutoModel.from_pretrained("bert-base-uncased")
Command:
torchrun --standalone --nnodes=1 --nproc-per-node=2 script.py
Output:
Traceback (most recent call last):
File "/home/ray/default/script.py", line 20, in <module>
model = AutoModel.from_pretrained("bert-base-uncased")
File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
return model_class.from_pretrained(
File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3583, in from_pretrained
init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 859, in __init__
_ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 778, in __init__
self._configure_train_batch_size()
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 956, in _configure_train_batch_size
self._batch_assertion()
File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 904, in _batch_assertion
assert train_batch == micro_batch * grad_acc * self.world_size, (
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 16 != 8 * 1 * 1
@pacman100 gentle bump on this!
Hello @pacman100 , upon investigating, I think this issue stems from the accelerate
library skipping the initialization of the DeepSpeed backend when a PyTorch distributed environment is detected.
Here is the relevant code at the following location: https://github.com/huggingface/accelerate/blob/v0.25.0/src/accelerate/state.py#L171
# This condition will be false, then DeepSpeed backend will be not initialized.
if not torch.distributed.is_initialized():
from deepspeed import comm as dist
# DeepSpeed always uses nccl
kwargs.pop("backend", None)
if is_xpu_available and is_ccl_available():
# Set DeepSpeed backend to ccl for xpu
self.backend = "ccl"
elif is_npu_available():
self.backend = "hccl"
else:
self.backend = "nccl"
dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs)
and I have a detailed analysis at the following link. https://github.com/ray-project/ray/issues/44204 Thank you!
While I haven't yet conducted extensive testing, it might be worth considering the substitution of deepspeed.comm.is_initialized()
in place of torch.distributed.is_initialized()
as a potential fix.
Maybe I can test this and see if this works without other side effect.
While I haven't yet conducted extensive testing, it might be worth considering the substitution of deepspeed.comm.is_initialized() in place of torch.distributed.is_initialized() as a potential fix. Maybe I can test this and if this works without other side effect.
Hello @sword865,
I looked at this issue with the simplified repro example given by @matthewdeng. Yes, your investigation is correct as well as the suggestion to replace the check torch.distributed.is_initialized()
with deepspeed.comm.is_initialized()
which is available in the minimum supported version of DeepSpeed in Accelerate 0.9.3. It would be great if you could raise PR with your suggested fix! Thank you!
Thank you, @pacman100. I have created a pull request with the fix. Could you please assist with the review?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Closing this issue since it is solved !