DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Results 1333 DeepSpeed issues
Sort by recently updated
recently updated
newest added

**Describe the bug** I am using zero3 to train a model, the memory consumption is higher than expected, so I dumped a torch memory trace (below), during fwd and bwd,...

bug
training

After this commit (https://github.com/deepspeedai/DeepSpeed/pull/4906), secondary partitioned tensors are updated only after optimizer.step(). When loading state_dict or resizing embedding after init, secondary partitioned tensors should be updated. e.g., https://github.com/huggingface/transformers/blob/1c4b62b219323a31011bac3bd3cece7675d9e4c3/src/transformers/integrations/deepspeed.py#L344

Hello, I want to perform inference on the HuggingFace MoE model [Qwen1.5-MoE-A2.7B](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B) with expert parallelism using DeepSpeed in a multi-GPU environment. However, the official tutorials are not comprehensive enough, and...

**Describe the bug** Our model has a small parameter with shape `torch.Size([32])`, when enable ZeRO++, it raise following error: - world_size: 2048 - zero_hpz_partition_size: 16 ``` File "/usr/local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1344,...

bug
training

**Is your feature request related to a problem? Please describe.** In some learning problems, the correct allreduce of gradients across data-parallel workers is SUM rather than MEAN. For example, when...

enhancement

**Describe the bug** In some use cases, we have to delete the training engine after training and load it again after some operations. What is the correct way to delete...

bug
training

The optimizer has been re-implemented to group parameters and set different learning rates for each group. However, after using DeepSpeed, all the `param_groups` are merged into one. How can this...

**Describe the bug** During evaluation of a [Mistral-Small-24B-Instruct-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501) I have code resembling the following: ```python from contextlib import contextmanager import torch.distributed as dist from deepspeed import DeepSpeedEngine from transformers import...

bug
inference

**Is your feature request related to a problem? Please describe.** On my hardware platform Jeton Agx Orin, the system does not support NCCL libraries。So i can not compile Deepspeed on...

enhancement

This is a living document! For each item here, we intend to link the PR/issue for discussion. This is DeepSpeed's first attempt at a public roadmap and will be updated...

roadmap