Joe Mayer
Joe Mayer
Adding configurable gradient accumulation data type to ZeRO stages 1 and 2. Added two gradient attributes to the parameter object: 1. `param.grad_reduc`: Used only as a pointer to reference which...
Addressing issue #2994
Adding the Pytorch Adagrad optimizer to the list of supported ZeRO optimizers.
[Issue 3367](https://github.com/microsoft/DeepSpeed/issues/3367)
Apex added bf16 to FusedAdam, syncing DeepSpeed's FusedAdam to also support bf16. Based on [this](https://github.com/microsoft/DeepSpeed/issues/3006) request. Fix #3006
Helps address this issue https://github.com/microsoft/DeepSpeedExamples/issues/525
How much more GPU memory will the TE layers consume vs a standard Pytorch layer? I'm seeing close to double the consumption for the opt-1.3b model when compared to standard...
Does every tensor used in TE need to have `requires_grad = True` ? I needed to add a dummy tensor for compatibility purposes to get activation checkpointing to work for...
### System Info ```Shell pip install accelerate. ``` ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks - [ ] One...
`num_bytes_per_thread` was a smaller type than `file_num_bytes`, this caused issues when dividing by `num_threads`.