DeepSpeed
DeepSpeed copied to clipboard
[BUG] `reduce_bucket_size` isn't validated against the model size
Describe the bug
When a model is small and the reduce_bucket_size is larger this happens:
File "/mnt/nvme0/code/huggingface/accelerate-master/src/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1981, in backward
self.allreduce_gradients()
File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1902, in allreduce_gradients
self.optimizer.overlapping_partition_gradients_reduce_epilogue()
File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1098, in overlapping_partition_gradients_reduce_epilogue
self.independent_gradient_partition_epilogue()
File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1075, in independent_gradient_partition_epilogue
self.__reduce_and_partition_ipg_grads()
File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1188, in __reduce_and_partition_ipg_grads
assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket)
AssertionError
self.__reduce_and_partition_ipg_grads()
File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/stas/anaconda3/envs/py39-pt21/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1188, in __reduce_and_partition_ipg_grads
assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket)
AssertionError
- it probably could check at launch time that the buffer isn't larger than the model size
- in any case the assert needs to tell the user what's wrong and how they can fix the problem - the current message-less assert isn't doing that.
Thanks.