DeepSpeed issues

[BUG] Delayed all_gather memory release using ZERO3

**Describe the bug** I am using zero3 to train a model, the memory consumption is higher than expected, so I dumped a torch memory trace (below), during fwd and bwd,...

KimmiShi

bug

training

[bugfix] update results of state_dict loading, embedding resizing to secondary partitions (hpz)

6

After this commit (https://github.com/deepspeedai/DeepSpeed/pull/4906), secondary partitioned tensors are updated only after optimizer.step(). When loading state_dict or resizing embedding after init, secondary partitioned tensors should be updated. e.g., https://github.com/huggingface/transformers/blob/1c4b62b219323a31011bac3bd3cece7675d9e4c3/src/transformers/integrations/deepspeed.py#L344

cyr0930

How to perform inference MoE model with expert parallel

5

Hello, I want to perform inference on the HuggingFace MoE model [Qwen1.5-MoE-A2.7B](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B) with expert parallelism using DeepSpeed in a multi-GPU environment. However, the official tutorials are not comprehensive enough, and...

Guodanding

[BUG] ZeRO++ sharding small parameter raise IndexError

2

**Describe the bug** Our model has a small parameter with shape `torch.Size([32])`, when enable ZeRO++, it raise following error: - world_size: 2048 - zero_hpz_partition_size: 16 ``` File "/usr/local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1344,...

wuxibin89

bug

training

[REQUEST] An option for SUM gradient allreduce instead of MEAN

5

**Is your feature request related to a problem? Please describe.** In some learning problems, the correct allreduce of gradients across data-parallel workers is SUM rather than MEAN. For example, when...

sfc-gh-lmerrick

enhancement

[BUG] Re-initializing the Engine

5

**Describe the bug** In some use cases, we have to delete the training engine after training and load it again after some operations. What is the correct way to delete...

BiEchi

bug

training

How can DeepSpeed be configured to prevent the merging of parameter groups

4

The optimizer has been re-implemented to group parameters and set different learning rates for each group. However, after using DeepSpeed, all the `param_groups` are merged into one. How can this...

Polarisamoon

[BUG] RuntimeError: tracing error at step X, due to `NOT_AVAILABLE` parameters

1

**Describe the bug** During evaluation of a [Mistral-Small-24B-Instruct-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501) I have code resembling the following: ```python from contextlib import contextmanager import torch.distributed as dist from deepspeed import DeepSpeedEngine from transformers import...

jamesbraza

bug

inference

[REQUEST] Support for compile without NCCL dependency

5

**Is your feature request related to a problem? Please describe.** On my hardware platform Jeton Agx Orin, the system does not support NCCL libraries。So i can not compile Deepspeed on...

studyingflying

enhancement

[Roadmap] DeepSpeed Roadmap Q1 2025

6

This is a living document! For each item here, we intend to link the PR/issue for discussion. This is DeepSpeed's first attempt at a public roadmap and will be updated...

loadams

roadmap

DeepSpeed
DeepSpeed copied to clipboard

Metadata

[BUG] Delayed all_gather memory release using ZERO3

[bugfix] update results of state_dict loading, embedding resizing to secondary partitions (hpz)

How to perform inference MoE model with expert parallel

[BUG] ZeRO++ sharding small parameter raise IndexError

[REQUEST] An option for SUM gradient allreduce instead of MEAN

[BUG] Re-initializing the Engine

How can DeepSpeed be configured to prevent the merging of parameter groups

[BUG] RuntimeError: tracing error at step X, due to `NOT_AVAILABLE` parameters

[REQUEST] Support for compile without NCCL dependency

[Roadmap] DeepSpeed Roadmap Q1 2025

← Metadata

Owner

Metadata

DeepSpeed DeepSpeed copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeed
DeepSpeed copied to clipboard