AssertionError: no sync context manager is incompatible with gradientpartitioning logic of ZeRo stage 3
I encountered an issue while using DeepSpeed with ZeRO Stage 3 optimization. I received the following error: no_sync is not compatible with ZeRO Stage 3. I’m not sure how to resolve this conflict.
If anyone has experience with this or knows how to resolve it, could you please guide me? Thank you in advance!
[rank0]: File "/root/miniconda3/envs/llama/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1997, in no_sync
[rank0]: assert not self.zero_optimization_partition_gradients(),
[rank0]: AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3
0%| | 0/168 [00:00<?, ?it/s]
W1126 23:28:07.821000 402381 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 402434 closing signal SIGTERM
E1126 23:28:11.641000 402381 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 402435) of binary: /root/miniconda3/envs/llama/bin/python
My guess is wrong, please see thehir0's reply
my code snippet:
def _broadcast_to_vllm(self, model: DeepSpeedEngine):
# avoid OOM
torch.cuda.empty_cache()
model = model.module
count, num_params = 0, len(list(model.named_parameters()))
for name, param in model.named_parameters():
count += 1 # empty_cache at last param
# Fire all vllm engines for broadcast
if torch.distributed.get_rank() == 0:
shape = param.shape if self.accelerator.deepspeed_plugin.zero_stage != 3 else param.ds_shape
refs = [
engine.update_weight.remote(name, dtype=param.dtype, shape=shape, empty_cache=count == num_params)
for engine in self.vllm_engines
]
# For ZeRO-3, allgather sharded parameter and broadcast to all vllm engines by rank 0
with deepspeed.zero.GatheredParameters([param], enabled=self.accelerator.deepspeed_plugin.zero_stage == 3):
if torch.distributed.get_rank() == 0:
torch.distributed.broadcast(param.data, 0, group=self._model_update_group)
ray.get(refs)
with deepspeed version 0.16.0 i have same error on: deepspeed.zero.GatheredParameters([param], enabled=self.accelerator.deepspeed_plugin.zero_stage == 3)
with deepspeed version 0.15.4:
_broadcast_to_vllm
with deepspeed.zero.GatheredParameters([param], enabled=self.accelerator.deepspeed_plugin.zero_stage == 3):
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2241, in __exit__
self.params[0].partition(param_list=self.params, has_been_updated=False)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1386, in partition
self._partition(param_list, has_been_updated=has_been_updated)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1535, in _partition
self._partition_param(param, has_been_updated=has_been_updated)
File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1568, in _partition_param
free_param(param)
File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 284, in free_param
assert not param.ds_active_sub_modules, param.ds_summary()
AssertionError: {'id': 0, 'status': 'AVAILABLE', 'numel': 544997376, 'ds_numel': 544997376, 'shape': (152064, 3584), 'ds_shape': (152064, 3584), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {2}, 'ds_tensor.shape': torch.Size([34062336])}
Everything works if grad_accum = 1, if grad_accum > 1, then these errors occur
use deepspeed==0.15.4 solve the problem.
I faced the same error with deepspeed==0.16.0, but it seems to be fine with deepspeed==0.15.4
use deepspeed==0.15.4 solve the problem.
it's work
I faced the same error with deepspeed==0.16.0, but it seems to be fine with deepspeed==0.15.4
Thank you, this is very helpful.
same issue in Zero3 training, it was likely related to this https://github.com/microsoft/DeepSpeed/pull/6675
@66RomanReigns I think this issue should be re-opened-- downgrading the version is not a long term fix. And it's also a problem for ZeRO Stage 2.
Same problem but ZeRo stage 2. Solved by using deepspeed==0.15.4. Thx~
Fixed this issue by setting gradient_accumulation_steps=1 while using deepspeed==0.16.0.
@66RomanReigns, @allblueee, @inkcherry the reason for this assertion is that no_sync context manager is meant to disable gradient reduction during the backward pass. However, this behavior conflicts with the gradient partitioning of ZeRO2 & ZeRO3 which requires gradient reduction. That is why we added the assertion to properly support no_sync context manager.
Can you explain why you need no_sync context manager in your code?
@thehir0, can you please open a separate ticket for your issue?
hi, @tjruwase , I think that the call to no_sync does not originate from the client code. As described in https://github.com/huggingface/transformers/issues/34984, it seems that no_sync is forced to be called by accelerate in cases where the gradient accumulation step has not yet reached the boundary, as shown in this section of the code. https://github.com/huggingface/transformers/blob/052e652d6d53c2b26ffde87e039b723949a53493/src/transformers/trainer.py#L2474C75-L2482
However, in practice(this case), DeepSpeed does not require this context call because it has its own mechanism for reducing grads &determining the gradient accumulation boundary.
Downgrading to 0.15.4 worked for me, thanks all!
Using Zero 1 and HF Trainer
Downgrading also worked for me. I was getting the error AssertionError: It is illegal to call Engine.step() inside no_sync context manager with stage 1.
+1, met this on deepspeed 0.16.1 with hf trainer
+1, met this on deepspeed 0.16.1 with hf trainer
The same problem with ZERO 3, HF trainer and deepspeed 0.16.1. Solved by downgrading to deepspeed 0.15.4.
I think this issue can be fixed by taking in https://github.com/huggingface/transformers/pull/35157
Fixed this issue by setting
gradient_accumulation_steps=1while using deepspeed==0.16.0.
i tried the same thing at the cost of OOM,it's not a long term fix
downgrading worked for me too (on ZERO 3)
Please DONOT downgrade.
Fixing this in latest deepspeed.
I got this error: AssertionError: It is illegal to call Engine.step() inside no_sync context manager and downgrading to 0.15.4 worked for me too!
solved by upgrading to 0.16.3
As mentioned in https://github.com/huggingface/transformers/pull/35157, this can be solved by upgrading transformers>=4.48.0
I guess it comes with compatible versions b/w transformers, in my case, I have to use deepspeed 0.15.4 for transforerms==4.37.2
@Hypothesis-Z It seems that https://github.com/huggingface/transformers/pull/35157 still does not resolve the issue.
updated to deepspeed==0.16.5 the error still here:
[rank0]: File "/home/aiscuser/.local/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2163, in no_sync
[rank0]: assert not self.zero_optimization_partition_gradients(),
[rank0]: AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3
deepspeed==0.15.4 works well.
deepspeed==0.16.4 remain wrong! deepspeed==0.15.4 works well.
Face this error with 0.16.7 - not sure about the reason, and the error details are unclear. Move back to 0.15.4.
Face this error with 0.16.8 - move back to 0.15.4 works for me ! @deepspeed team, please check this in latest version.