DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

AssertionError: no sync context manager is incompatible with gradientpartitioning logic of ZeRo stage 3

Open 66RomanReigns opened this issue 1 year ago • 30 comments

I encountered an issue while using DeepSpeed with ZeRO Stage 3 optimization. I received the following error: no_sync is not compatible with ZeRO Stage 3. I’m not sure how to resolve this conflict.

If anyone has experience with this or knows how to resolve it, could you please guide me? Thank you in advance!

[rank0]: File "/root/miniconda3/envs/llama/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1997, in no_sync [rank0]: assert not self.zero_optimization_partition_gradients(),
[rank0]: AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3 0%| | 0/168 [00:00<?, ?it/s] W1126 23:28:07.821000 402381 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 402434 closing signal SIGTERM E1126 23:28:11.641000 402381 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 402435) of binary: /root/miniconda3/envs/llama/bin/python

66RomanReigns avatar Nov 26 '24 15:11 66RomanReigns

My guess is wrong, please see thehir0's reply

WeiluXu avatar Nov 26 '24 19:11 WeiluXu

my code snippet:

    def _broadcast_to_vllm(self, model: DeepSpeedEngine):
        # avoid OOM
        torch.cuda.empty_cache()
        model = model.module
        count, num_params = 0, len(list(model.named_parameters()))
        for name, param in model.named_parameters():
            count += 1  # empty_cache at last param

            # Fire all vllm engines for broadcast
            if torch.distributed.get_rank() == 0:
                shape = param.shape if self.accelerator.deepspeed_plugin.zero_stage != 3 else param.ds_shape
                refs = [
                    engine.update_weight.remote(name, dtype=param.dtype, shape=shape, empty_cache=count == num_params)
                    for engine in self.vllm_engines
                ]

            # For ZeRO-3, allgather sharded parameter and broadcast to all vllm engines by rank 0
            with deepspeed.zero.GatheredParameters([param], enabled=self.accelerator.deepspeed_plugin.zero_stage == 3):
                if torch.distributed.get_rank() == 0:
                    torch.distributed.broadcast(param.data, 0, group=self._model_update_group)
                    ray.get(refs)

with deepspeed version 0.16.0 i have same error on: deepspeed.zero.GatheredParameters([param], enabled=self.accelerator.deepspeed_plugin.zero_stage == 3)

with deepspeed version 0.15.4:

_broadcast_to_vllm
    with deepspeed.zero.GatheredParameters([param], enabled=self.accelerator.deepspeed_plugin.zero_stage == 3):
  File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2241, in __exit__
    self.params[0].partition(param_list=self.params, has_been_updated=False)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1386, in partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1535, in _partition
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1568, in _partition_param
    free_param(param)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 284, in free_param
    assert not param.ds_active_sub_modules, param.ds_summary()
AssertionError: {'id': 0, 'status': 'AVAILABLE', 'numel': 544997376, 'ds_numel': 544997376, 'shape': (152064, 3584), 'ds_shape': (152064, 3584), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {2}, 'ds_tensor.shape': torch.Size([34062336])}

Everything works if grad_accum = 1, if grad_accum > 1, then these errors occur

thehir0 avatar Nov 26 '24 23:11 thehir0

use deepspeed==0.15.4 solve the problem.

LaoWangGB avatar Nov 27 '24 03:11 LaoWangGB

I faced the same error with deepspeed==0.16.0, but it seems to be fine with deepspeed==0.15.4

yejoon-lee avatar Nov 27 '24 06:11 yejoon-lee

use deepspeed==0.15.4 solve the problem.

it's work

yuanyangeli avatar Nov 27 '24 08:11 yuanyangeli

I faced the same error with deepspeed==0.16.0, but it seems to be fine with deepspeed==0.15.4

Thank you, this is very helpful.

Luxanna-Real avatar Nov 27 '24 14:11 Luxanna-Real

same issue in Zero3 training, it was likely related to this https://github.com/microsoft/DeepSpeed/pull/6675

inkcherry avatar Nov 28 '24 02:11 inkcherry

@66RomanReigns I think this issue should be re-opened-- downgrading the version is not a long term fix. And it's also a problem for ZeRO Stage 2.

samar-khanna avatar Dec 02 '24 21:12 samar-khanna

Same problem but ZeRo stage 2. Solved by using deepspeed==0.15.4. Thx~

allblueee avatar Dec 03 '24 11:12 allblueee

Fixed this issue by setting gradient_accumulation_steps=1 while using deepspeed==0.16.0.

dalan2014 avatar Dec 04 '24 06:12 dalan2014

@66RomanReigns, @allblueee, @inkcherry the reason for this assertion is that no_sync context manager is meant to disable gradient reduction during the backward pass. However, this behavior conflicts with the gradient partitioning of ZeRO2 & ZeRO3 which requires gradient reduction. That is why we added the assertion to properly support no_sync context manager.

Can you explain why you need no_sync context manager in your code?

tjruwase avatar Dec 04 '24 16:12 tjruwase

@thehir0, can you please open a separate ticket for your issue?

tjruwase avatar Dec 04 '24 16:12 tjruwase

hi, @tjruwase , I think that the call to no_sync does not originate from the client code. As described in https://github.com/huggingface/transformers/issues/34984, it seems that no_sync is forced to be called by accelerate in cases where the gradient accumulation step has not yet reached the boundary, as shown in this section of the code. https://github.com/huggingface/transformers/blob/052e652d6d53c2b26ffde87e039b723949a53493/src/transformers/trainer.py#L2474C75-L2482

However, in practice(this case), DeepSpeed does not require this context call because it has its own mechanism for reducing grads &determining the gradient accumulation boundary.

inkcherry avatar Dec 05 '24 02:12 inkcherry

Downgrading to 0.15.4 worked for me, thanks all!

Using Zero 1 and HF Trainer

morganmcg1 avatar Dec 06 '24 19:12 morganmcg1

Downgrading also worked for me. I was getting the error AssertionError: It is illegal to call Engine.step() inside no_sync context manager with stage 1.

dabs9 avatar Dec 09 '24 20:12 dabs9

+1, met this on deepspeed 0.16.1 with hf trainer

rangehow avatar Dec 12 '24 11:12 rangehow

+1, met this on deepspeed 0.16.1 with hf trainer

The same problem with ZERO 3, HF trainer and deepspeed 0.16.1. Solved by downgrading to deepspeed 0.15.4.

Kyle-Lyu avatar Dec 13 '24 01:12 Kyle-Lyu

I think this issue can be fixed by taking in https://github.com/huggingface/transformers/pull/35157

inkcherry avatar Dec 19 '24 11:12 inkcherry

Fixed this issue by setting gradient_accumulation_steps=1 while using deepspeed==0.16.0.

i tried the same thing at the cost of OOM,it's not a long term fix

chuangzhidan avatar Jan 16 '25 02:01 chuangzhidan

downgrading worked for me too (on ZERO 3)

RyanMarten avatar Jan 20 '25 23:01 RyanMarten

Please DONOT downgrade.

Fixing this in latest deepspeed.

lucasjinreal avatar Jan 26 '25 05:01 lucasjinreal

I got this error: AssertionError: It is illegal to call Engine.step() inside no_sync context manager and downgrading to 0.15.4 worked for me too!

vishaal27 avatar Feb 04 '25 15:02 vishaal27

solved by upgrading to 0.16.3

Yangr116 avatar Feb 10 '25 09:02 Yangr116

As mentioned in https://github.com/huggingface/transformers/pull/35157, this can be solved by upgrading transformers>=4.48.0

Hypothesis-Z avatar Feb 14 '25 06:02 Hypothesis-Z

I guess it comes with compatible versions b/w transformers, in my case, I have to use deepspeed 0.15.4 for transforerms==4.37.2

ruian1 avatar Feb 21 '25 19:02 ruian1

@Hypothesis-Z It seems that https://github.com/huggingface/transformers/pull/35157 still does not resolve the issue.

jianguoz avatar Mar 04 '25 19:03 jianguoz

updated to deepspeed==0.16.5 the error still here:

[rank0]: File "/home/aiscuser/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2163, in no_sync [rank0]: assert not self.zero_optimization_partition_gradients(),
[rank0]: AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3

deepspeed==0.15.4 works well.

rootmq avatar Apr 10 '25 07:04 rootmq

deepspeed==0.16.4 remain wrong! deepspeed==0.15.4 works well.

Kairong-Han avatar May 13 '25 08:05 Kairong-Han

Face this error with 0.16.7 - not sure about the reason, and the error details are unclear. Move back to 0.15.4.

doem97 avatar May 17 '25 08:05 doem97

Face this error with 0.16.8 - move back to 0.15.4 works for me ! @deepspeed team, please check this in latest version.

allzero-kwon avatar May 22 '25 10:05 allzero-kwon