DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] `reduce_bucket_size` isn't validated against the model size

Open stas00 opened this issue 1 year ago • 0 comments

Describe the bug

When a model is small and the reduce_bucket_size is larger this happens:

  File "/mnt/nvme0/code/huggingface/accelerate-master/src/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1981, in backward
    self.allreduce_gradients()
  File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1902, in allreduce_gradients
    self.optimizer.overlapping_partition_gradients_reduce_epilogue()
  File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1098, in overlapping_partition_gradients_reduce_epilogue
    self.independent_gradient_partition_epilogue()
  File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1075, in independent_gradient_partition_epilogue
    self.__reduce_and_partition_ipg_grads()
  File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1188, in __reduce_and_partition_ipg_grads
    assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket)
AssertionError
    self.__reduce_and_partition_ipg_grads()
  File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1188, in __reduce_and_partition_ipg_grads
    assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket)
AssertionError
  1. it probably could check at launch time that the buffer isn't larger than the model size
  2. in any case the assert needs to tell the user what's wrong and how they can fix the problem - the current message-less assert isn't doing that.

Thanks.

stas00 avatar Dec 26 '23 23:12 stas00