DeepSpeed
DeepSpeed copied to clipboard
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
**Describe the bug** I am using zero3 to train a model, the memory consumption is higher than expected, so I dumped a torch memory trace (below), during fwd and bwd,...
After this commit (https://github.com/deepspeedai/DeepSpeed/pull/4906), secondary partitioned tensors are updated only after optimizer.step(). When loading state_dict or resizing embedding after init, secondary partitioned tensors should be updated. e.g., https://github.com/huggingface/transformers/blob/1c4b62b219323a31011bac3bd3cece7675d9e4c3/src/transformers/integrations/deepspeed.py#L344
Hello, I want to perform inference on the HuggingFace MoE model [Qwen1.5-MoE-A2.7B](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B) with expert parallelism using DeepSpeed in a multi-GPU environment. However, the official tutorials are not comprehensive enough, and...
**Describe the bug** Our model has a small parameter with shape `torch.Size([32])`, when enable ZeRO++, it raise following error: - world_size: 2048 - zero_hpz_partition_size: 16 ``` File "/usr/local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1344,...
**Is your feature request related to a problem? Please describe.** In some learning problems, the correct allreduce of gradients across data-parallel workers is SUM rather than MEAN. For example, when...
**Describe the bug** In some use cases, we have to delete the training engine after training and load it again after some operations. What is the correct way to delete...
The optimizer has been re-implemented to group parameters and set different learning rates for each group. However, after using DeepSpeed, all the `param_groups` are merged into one. How can this...
**Describe the bug** During evaluation of a [Mistral-Small-24B-Instruct-2501](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501) I have code resembling the following: ```python from contextlib import contextmanager import torch.distributed as dist from deepspeed import DeepSpeedEngine from transformers import...
**Is your feature request related to a problem? Please describe.** On my hardware platform Jeton Agx Orin, the system does not support NCCL libraries。So i can not compile Deepspeed on...
This is a living document! For each item here, we intend to link the PR/issue for discussion. This is DeepSpeed's first attempt at a public roadmap and will be updated...