DeepSpeed issues

Failed to reproduce the offload example with huggingface transformers

10

I failed to reproduce the example on [deepspeed tutorials](https://www.deepspeed.ai/tutorials/zero-offload/) with huggingface transformers. The main problem is that I need the memory space at least 3x parameters, and it would be...

arminzhu

[BUG]when use 'overlap_comm:True' w/ 'contiguous_gradients:True', grad_norm is NaN

2

**Describe the bug** As title, when enable overlap_comm and contiguous_gradients together, grad_norm will be nan (or be a constant float value in the latest master code, w/ this pr:https://github.com/deepspeedai/DeepSpeed/pull/7171 ,...

whlook

bug

training

[BUG] OSError: MPI environment variables are not set.

**Describe the bug** I am using ds-0.18.0 setting the launcher as openmpi, and get an error about MPI environment variables. **To Reproduce** run like following commond: ``` deepspeed \ --hostfile=${HOSTFILE_PATH}...

Mulbetty

bug

training

Fix that ds_secondary_tensor may be dirty when loading the model or zero checkpoint for zero++.

6

`ds_secondary_tensor` may be dirty during model loading or zero checkpointing for zero++. * 1 Loading model My task is transformers SFT. In the transformers code, initialization is done using code...

zhengchenyu

Enabled compiled autograd for backward pass

6

Compiled Autograd is an extension to torch.compile which enhances the autograd engine by capturing a larger backward computation graph at runtime. This allows a more comprehensive optimization of the backward...

deepcharm

[ROCm] Relax tolerances for FP8 unit test for fp16 and bf16 cases

Relaxing the tolerance values to enable the below two unit testa, with FP16 and BF16 data types on ROCm ``` unit/runtime/half_precision/test_fp8.py::TestFp8ComposabilityAcrossZero::test[bf16] unit/runtime/half_precision/test_fp8.py::TestFp8ComposabilityAcrossZero::test[fp32] ```

rraminen

Low-precision master params/grads/optimizer states

DeepSpeed optimizer always creates fp32 master params/gradients/optimizer states. However, we sometimes want to keep them lower precision given [torch.autocast support](https://deepspeed.readthedocs.io/en/latest/training.html#mixed-precision-training). This PR allows lower precision master params/grads/optimizer states with bf16/fp16...

tohtana

[AMD][ROCm] Improve support of AMD

6

The patch delivers several fixes for building issues for CUDA part of DeepSpeed library. Percentage of passed unit tests improved(tested on RDNA hardware, gfx110x and gfx12x) Before: collected 5298 items...

k-artem

[BUG] self.average_tensor(bucket.buffer[bucket.index].narrow(0, 0, bucket.elements), comm_dtype) IndexError: list index out of range

**Describe the bug** it crash on first backward in train when i use Deepspeed-zero2 **Information** Here is traceback: ``` Training: 0it [00:00, ?it/s] Training: 0%| | 0/40320 [00:00

QWXL

bug

training

Tracking excessive cpu memory usage in z2 cpu offload

1

I have observed about `numel * 28 bytes` cpu memory requirements for doing z2+optim cpu offload. Here is the mapping out of current cpu memory allocations when offload_optimizer.device: cpu is...

stas00

DeepSpeed
DeepSpeed copied to clipboard

Metadata

Failed to reproduce the offload example with huggingface transformers

[BUG]when use 'overlap_comm:True' w/ 'contiguous_gradients:True', grad_norm is NaN

[BUG] OSError: MPI environment variables are not set.

Fix that ds_secondary_tensor may be dirty when loading the model or zero checkpoint for zero++.

Enabled compiled autograd for backward pass

[ROCm] Relax tolerances for FP8 unit test for fp16 and bf16 cases

Low-precision master params/grads/optimizer states

[AMD][ROCm] Improve support of AMD

[BUG] self.average_tensor(bucket.buffer[bucket.index].narrow(0, 0, bucket.elements), comm_dtype) IndexError: list index out of range

Tracking excessive cpu memory usage in z2 cpu offload

← Metadata

Owner

Metadata

DeepSpeed DeepSpeed copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeed
DeepSpeed copied to clipboard