Joe Mayer issues

Results 10 issues of


                                            Joe Mayer

ZeRO Gradient Accumulation Dtype.

Adding configurable gradient accumulation data type to ZeRO stages 1 and 2. Added two gradient attributes to the parameter object: 1. `param.grad_reduc`: Used only as a pointer to reference which...

Add ZeRO 1 support to PP for BF16.

Addressing issue #2994

Adagrad support in ZeRO

Adding the Pytorch Adagrad optimizer to the list of supported ZeRO optimizers.

Changing monitor loss to aggregate loss over gradient accumulation steps

[Issue 3367](https://github.com/microsoft/DeepSpeed/issues/3367)

Syncing FusedAdam with new Apex features

Apex added bf16 to FusedAdam, syncing DeepSpeed's FusedAdam to also support bf16. Based on [this](https://github.com/microsoft/DeepSpeed/issues/3006) request. Fix #3006

Adding assertion for mp_group in HE.

Helps address this issue https://github.com/microsoft/DeepSpeedExamples/issues/525

FP8 Memory Usage

How much more GPU memory will the TE layers consume vs a standard Pytorch layer? I'm seeing close to double the consumption for the opt-1.3b model when compared to standard...

Is requires_grad mandatory?

Does every tensor used in TE need to have `requires_grad = True` ? I needed to add a dummy tensor for compatibility purposes to get activation checkpointing to work for...

bug

Incorrect Argument Default for DeepSpeed Multi-node Training

### System Info ```Shell pip install accelerate. ``` ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks - [ ] One...

Fix Type Mismatch

`num_bytes_per_thread` was a smaller type than `file_num_bytes`, this caused issues when dividing by `num_threads`.