DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Results 1333 DeepSpeed issues
Sort by recently updated
recently updated
newest added

I failed to reproduce the example on [deepspeed tutorials](https://www.deepspeed.ai/tutorials/zero-offload/) with huggingface transformers. The main problem is that I need the memory space at least 3x parameters, and it would be...

**Describe the bug** As title, when enable overlap_comm and contiguous_gradients together, grad_norm will be nan (or be a constant float value in the latest master code, w/ this pr:https://github.com/deepspeedai/DeepSpeed/pull/7171 ,...

bug
training

**Describe the bug** I am using ds-0.18.0 setting the launcher as openmpi, and get an error about MPI environment variables. **To Reproduce** run like following commond: ``` deepspeed \ --hostfile=${HOSTFILE_PATH}...

bug
training

`ds_secondary_tensor` may be dirty during model loading or zero checkpointing for zero++. * 1 Loading model My task is transformers SFT. In the transformers code, initialization is done using code...

Compiled Autograd is an extension to torch.compile which enhances the autograd engine by capturing a larger backward computation graph at runtime. This allows a more comprehensive optimization of the backward...

Relaxing the tolerance values to enable the below two unit testa, with FP16 and BF16 data types on ROCm ``` unit/runtime/half_precision/test_fp8.py::TestFp8ComposabilityAcrossZero::test[bf16] unit/runtime/half_precision/test_fp8.py::TestFp8ComposabilityAcrossZero::test[fp32] ```

DeepSpeed optimizer always creates fp32 master params/gradients/optimizer states. However, we sometimes want to keep them lower precision given [torch.autocast support](https://deepspeed.readthedocs.io/en/latest/training.html#mixed-precision-training). This PR allows lower precision master params/grads/optimizer states with bf16/fp16...

The patch delivers several fixes for building issues for CUDA part of DeepSpeed library. Percentage of passed unit tests improved(tested on RDNA hardware, gfx110x and gfx12x) Before: collected 5298 items...

**Describe the bug** it crash on first backward in train when i use Deepspeed-zero2 **Information** Here is traceback: ``` Training: 0it [00:00, ?it/s] Training: 0%| | 0/40320 [00:00

bug
training

I have observed about `numel * 28 bytes` cpu memory requirements for doing z2+optim cpu offload. Here is the mapping out of current cpu memory allocations when offload_optimizer.device: cpu is...