transformers icon indicating copy to clipboard operation
transformers copied to clipboard

DeepSpeed ZeRO3 errors on config initialization

Open matthewdeng opened this issue 1 year ago • 9 comments

System Info

transformers-cli env:

  • transformers version: 4.37.2
  • Platform: Linux-6.2.0-1017-aws-x86_64-with-glibc2.31
  • Python version: 3.9.18
  • Huggingface_hub version: 0.19.4
  • Safetensors version: 0.4.1
  • Accelerate version: 0.26.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1+cu118 (True)
  • Tensorflow version (GPU?): 2.11.0 (True)
  • Flax version (CPU?/GPU?/TPU?): 0.7.2 (cpu)
  • Jax version: 0.4.13
  • JaxLib version: 0.4.13
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes

Relevant Dependencies:

accelerate==0.26.1
deepspeed==0.12.3
ray==2.9.1
transformers==4.37.2

Who can help?

@pacman100 @muellerzr

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

I'm running the following script on a g4dn.12xlarge instance.

import torch.distributed
from transformers import AutoModel, TrainingArguments
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer

def train_func():
    assert torch.distributed.is_initialized(), "Torch Distributed must be initialized."

    deepspeed_config = {
        "zero_optimization": {
            "stage": 3,
        },
        "train_batch_size": "auto",
        "train_micro_batch_size_per_gpu": "auto",
    }

    train_args = TrainingArguments(
        output_dir="./",
        deepspeed=deepspeed_config,
    )

    model = AutoModel.from_pretrained("bert-base-uncased")

trainer = TorchTrainer(
    train_loop_per_worker=train_func,
    scaling_config=ScalingConfig(
        num_workers=2,
        use_gpu=True,
    )
)
trainer.fit()

This errors with:

  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 118, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/home/ray/default/simple.py", line 22, in train_func
    model = AutoModel.from_pretrained("bert-base-uncased")
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3583, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 859, in __init__
    _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 781, in __init__
    self._configure_train_batch_size()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 959, in _configure_train_batch_size
    self._batch_assertion()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 907, in _batch_assertion
    assert train_batch == micro_batch * grad_acc * self.world_size, (
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 16 != 8 * 1 * 1

I did some debugging and it seems like world_size is being set to 1 because dist is not initialized yet here.

I also did some bisection and saw that the error started occurring in transformers==4.30.0

Related Issues:

  • https://github.com/microsoft/DeepSpeed/issues/3341 - this seems to be the exact same issue, but I haven't looked deep enough to understand if the issue lies in DeepSpeed or Transformers or Accelerate.

Expected behavior

The script should run without error and DeepSpeed distributed environment should be inherited from the existing Torch process group.

The issue does not occur if I use ZeRO2.

        "zero_optimization": {
-            "stage": 3,
+            "stage": 2,
        },

The issue can also be mitigated by manually initializing the DeepSpeed distributed environment with deepspeed.init_distributed().

matthewdeng avatar Feb 01 '24 00:02 matthewdeng

cc @pacman100 and @SunMarc

ArthurZucker avatar Feb 01 '24 13:02 ArthurZucker

Hello @pacman100 @SunMarc could you review this issue? Thanks so much!

matthewdeng avatar Feb 26 '24 18:02 matthewdeng

Thank you @matthewdeng for raising the issue, I am unfamiliar with ray but looking into this.

pacman100 avatar Feb 27 '24 05:02 pacman100

Oops, sorry for including that part. The same behavior can be seen with torchrun.

script.py:

import torch.distributed
from transformers import AutoModel, TrainingArguments

torch.distributed.init_process_group(backend="nccl")

deepspeed_config = {
    "zero_optimization": {
        "stage": 3,
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
}

train_args = TrainingArguments(
    output_dir="./",
    deepspeed=deepspeed_config,
)

model = AutoModel.from_pretrained("bert-base-uncased")

Command:

torchrun --standalone --nnodes=1 --nproc-per-node=2 script.py

Output:

Traceback (most recent call last):
  File "/home/ray/default/script.py", line 20, in <module>
    model = AutoModel.from_pretrained("bert-base-uncased")
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ray/anaconda3/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3583, in from_pretrained
    init_contexts = [deepspeed.zero.Init(config_dict_or_path=deepspeed_config())] + init_contexts
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 859, in __init__
    _ds_config = deepspeed.runtime.config.DeepSpeedConfig(config_dict_or_path,
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 778, in __init__
    self._configure_train_batch_size()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 956, in _configure_train_batch_size
    self._batch_assertion()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/config.py", line 904, in _batch_assertion
    assert train_batch == micro_batch * grad_acc * self.world_size, (
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 16 != 8 * 1 * 1

matthewdeng avatar Feb 28 '24 04:02 matthewdeng

@pacman100 gentle bump on this!

matthewdeng avatar Mar 12 '24 20:03 matthewdeng

Hello @pacman100 , upon investigating, I think this issue stems from the accelerate library skipping the initialization of the DeepSpeed backend when a PyTorch distributed environment is detected.

Here is the relevant code at the following location: https://github.com/huggingface/accelerate/blob/v0.25.0/src/accelerate/state.py#L171

# This condition will be false, then DeepSpeed backend will be not initialized.
if not torch.distributed.is_initialized():
    from deepspeed import comm as dist

    # DeepSpeed always uses nccl
    kwargs.pop("backend", None)
    if is_xpu_available and is_ccl_available():
        # Set DeepSpeed backend to ccl for xpu
        self.backend = "ccl"
    elif is_npu_available():
        self.backend = "hccl"
    else:
        self.backend = "nccl"
    dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs)

and I have a detailed analysis at the following link. https://github.com/ray-project/ray/issues/44204 Thank you!

sword865 avatar Mar 21 '24 08:03 sword865

While I haven't yet conducted extensive testing, it might be worth considering the substitution of deepspeed.comm.is_initialized() in place of torch.distributed.is_initialized() as a potential fix. Maybe I can test this and see if this works without other side effect.

sword865 avatar Mar 21 '24 09:03 sword865

While I haven't yet conducted extensive testing, it might be worth considering the substitution of deepspeed.comm.is_initialized() in place of torch.distributed.is_initialized() as a potential fix. Maybe I can test this and if this works without other side effect.

Hello @sword865,

I looked at this issue with the simplified repro example given by @matthewdeng. Yes, your investigation is correct as well as the suggestion to replace the check torch.distributed.is_initialized() with deepspeed.comm.is_initialized() which is available in the minimum supported version of DeepSpeed in Accelerate 0.9.3. It would be great if you could raise PR with your suggested fix! Thank you!

pacman100 avatar Mar 21 '24 11:03 pacman100

Thank you, @pacman100. I have created a pull request with the fix. Could you please assist with the review?

sword865 avatar Mar 22 '24 06:03 sword865

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 15 '24 08:04 github-actions[bot]

Closing this issue since it is solved !

SunMarc avatar Apr 15 '24 09:04 SunMarc