DeepSpeed AssertionError: no sync context manager is incompatible with gradientpartitioning logic of ZeRo stage 3

I encountered an issue while using DeepSpeed with ZeRO Stage 3 optimization. I received the following error: no_sync is not compatible with ZeRO Stage 3. I’m not sure how to resolve this conflict.

If anyone has experience with this or knows how to resolve it, could you please guide me? Thank you in advance!

[rank0]: File "/root/miniconda3/envs/llama/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1997, in no_sync [rank0]: assert not self.zero_optimization_partition_gradients(),
[rank0]: AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3 0%| | 0/168 [00:00<?, ?it/s] W1126 23:28:07.821000 402381 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 402434 closing signal SIGTERM E1126 23:28:11.641000 402381 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 402435) of binary: /root/miniconda3/envs/llama/bin/python

Nov 26 '24 15:11 66RomanReigns

My guess is wrong, please see thehir0's reply

Nov 26 '24 19:11 WeiluXu

my code snippet:

    def _broadcast_to_vllm(self, model: DeepSpeedEngine):
        # avoid OOM
        torch.cuda.empty_cache()
        model = model.module
        count, num_params = 0, len(list(model.named_parameters()))
        for name, param in model.named_parameters():
            count += 1  # empty_cache at last param

            # Fire all vllm engines for broadcast
            if torch.distributed.get_rank() == 0:
                shape = param.shape if self.accelerator.deepspeed_plugin.zero_stage != 3 else param.ds_shape
                refs = [
                    engine.update_weight.remote(name, dtype=param.dtype, shape=shape, empty_cache=count == num_params)
                    for engine in self.vllm_engines
                ]

            # For ZeRO-3, allgather sharded parameter and broadcast to all vllm engines by rank 0
            with deepspeed.zero.GatheredParameters([param], enabled=self.accelerator.deepspeed_plugin.zero_stage == 3):
                if torch.distributed.get_rank() == 0:
                    torch.distributed.broadcast(param.data, 0, group=self._model_update_group)
                    ray.get(refs)

with deepspeed version 0.16.0 i have same error on: deepspeed.zero.GatheredParameters([param], enabled=self.accelerator.deepspeed_plugin.zero_stage == 3)

with deepspeed version 0.15.4:

_broadcast_to_vllm
    with deepspeed.zero.GatheredParameters([param], enabled=self.accelerator.deepspeed_plugin.zero_stage == 3):
  File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2241, in __exit__
    self.params[0].partition(param_list=self.params, has_been_updated=False)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1386, in partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1535, in _partition
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1568, in _partition_param
    free_param(param)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 284, in free_param
    assert not param.ds_active_sub_modules, param.ds_summary()
AssertionError: {'id': 0, 'status': 'AVAILABLE', 'numel': 544997376, 'ds_numel': 544997376, 'shape': (152064, 3584), 'ds_shape': (152064, 3584), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {2}, 'ds_tensor.shape': torch.Size([34062336])}

Everything works if grad_accum = 1, if grad_accum > 1, then these errors occur

Nov 26 '24 23:11 thehir0

use deepspeed==0.15.4 solve the problem.

Nov 27 '24 03:11 LaoWangGB

I faced the same error with deepspeed==0.16.0, but it seems to be fine with deepspeed==0.15.4

Nov 27 '24 06:11 yejoon-lee

use deepspeed==0.15.4 solve the problem.

it's work

Nov 27 '24 08:11 yuanyangeli

I faced the same error with deepspeed==0.16.0, but it seems to be fine with deepspeed==0.15.4

Thank you, this is very helpful.

Nov 27 '24 14:11 Luxanna-Real

same issue in Zero3 training, it was likely related to this https://github.com/microsoft/DeepSpeed/pull/6675

Nov 28 '24 02:11 inkcherry

@66RomanReigns I think this issue should be re-opened-- downgrading the version is not a long term fix. And it's also a problem for ZeRO Stage 2.

Dec 02 '24 21:12 samar-khanna

Same problem but ZeRo stage 2. Solved by using deepspeed==0.15.4. Thx~

Dec 03 '24 11:12 allblueee

Fixed this issue by setting gradient_accumulation_steps=1 while using deepspeed==0.16.0.

Dec 04 '24 06:12 dalan2014

@66RomanReigns, @allblueee, @inkcherry the reason for this assertion is that no_sync context manager is meant to disable gradient reduction during the backward pass. However, this behavior conflicts with the gradient partitioning of ZeRO2 & ZeRO3 which requires gradient reduction. That is why we added the assertion to properly support no_sync context manager.

Can you explain why you need no_sync context manager in your code?

Dec 04 '24 16:12 tjruwase

@thehir0, can you please open a separate ticket for your issue?

Dec 04 '24 16:12 tjruwase

hi, @tjruwase , I think that the call to no_sync does not originate from the client code. As described in https://github.com/huggingface/transformers/issues/34984, it seems that no_sync is forced to be called by accelerate in cases where the gradient accumulation step has not yet reached the boundary, as shown in this section of the code. https://github.com/huggingface/transformers/blob/052e652d6d53c2b26ffde87e039b723949a53493/src/transformers/trainer.py#L2474C75-L2482

However, in practice(this case), DeepSpeed does not require this context call because it has its own mechanism for reducing grads &determining the gradient accumulation boundary.

Dec 05 '24 02:12 inkcherry

Downgrading to 0.15.4 worked for me, thanks all!

Using Zero 1 and HF Trainer

Dec 06 '24 19:12 morganmcg1

Downgrading also worked for me. I was getting the error AssertionError: It is illegal to call Engine.step() inside no_sync context manager with stage 1.

Dec 09 '24 20:12 dabs9

+1, met this on deepspeed 0.16.1 with hf trainer

Dec 12 '24 11:12 rangehow

+1, met this on deepspeed 0.16.1 with hf trainer

The same problem with ZERO 3, HF trainer and deepspeed 0.16.1. Solved by downgrading to deepspeed 0.15.4.

Dec 13 '24 01:12 Kyle-Lyu

I think this issue can be fixed by taking in https://github.com/huggingface/transformers/pull/35157

Dec 19 '24 11:12 inkcherry

Fixed this issue by setting gradient_accumulation_steps=1 while using deepspeed==0.16.0.

i tried the same thing at the cost of OOM,it's not a long term fix

Jan 16 '25 02:01 chuangzhidan

downgrading worked for me too (on ZERO 3)

Jan 20 '25 23:01 RyanMarten

Please DONOT downgrade.

Fixing this in latest deepspeed.

Jan 26 '25 05:01 lucasjinreal

I got this error: AssertionError: It is illegal to call Engine.step() inside no_sync context manager and downgrading to 0.15.4 worked for me too!

Feb 04 '25 15:02 vishaal27

solved by upgrading to 0.16.3

Feb 10 '25 09:02 Yangr116

As mentioned in https://github.com/huggingface/transformers/pull/35157, this can be solved by upgrading transformers>=4.48.0

Feb 14 '25 06:02 Hypothesis-Z

I guess it comes with compatible versions b/w transformers, in my case, I have to use deepspeed 0.15.4 for transforerms==4.37.2

Feb 21 '25 19:02 ruian1

@Hypothesis-Z It seems that https://github.com/huggingface/transformers/pull/35157 still does not resolve the issue.

Mar 04 '25 19:03 jianguoz

updated to deepspeed==0.16.5 the error still here:

[rank0]: File "/home/aiscuser/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2163, in no_sync [rank0]: assert not self.zero_optimization_partition_gradients(),
[rank0]: AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3

deepspeed==0.15.4 works well.

Apr 10 '25 07:04 rootmq

deepspeed==0.16.4 remain wrong! deepspeed==0.15.4 works well.

May 13 '25 08:05 Kairong-Han

Face this error with 0.16.7 - not sure about the reason, and the error details are unclear. Move back to 0.15.4.

May 17 '25 08:05 doem97

Face this error with 0.16.8 - move back to 0.15.4 works for me ! @deepspeed team, please check this in latest version.

May 22 '25 10:05 allzero-kwon