Joe Mayer

Results 10 issues of Joe Mayer

Adding configurable gradient accumulation data type to ZeRO stages 1 and 2. Added two gradient attributes to the parameter object: 1. `param.grad_reduc`: Used only as a pointer to reference which...

Adding the Pytorch Adagrad optimizer to the list of supported ZeRO optimizers.

[Issue 3367](https://github.com/microsoft/DeepSpeed/issues/3367)

Apex added bf16 to FusedAdam, syncing DeepSpeed's FusedAdam to also support bf16. Based on [this](https://github.com/microsoft/DeepSpeed/issues/3006) request. Fix #3006

Helps address this issue https://github.com/microsoft/DeepSpeedExamples/issues/525

How much more GPU memory will the TE layers consume vs a standard Pytorch layer? I'm seeing close to double the consumption for the opt-1.3b model when compared to standard...

Does every tensor used in TE need to have `requires_grad = True` ? I needed to add a dummy tensor for compatibility purposes to get activation checkpointing to work for...

bug

### System Info ```Shell pip install accelerate. ``` ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks - [ ] One...

`num_bytes_per_thread` was a smaller type than `file_num_bytes`, this caused issues when dividing by `num_threads`.