DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Results 1333 DeepSpeed issues
Sort by recently updated
recently updated
newest added

Regarding this issue: https://github.com/pytorch/pytorch/issues/97079 There are some comm ops in deepspeed, which for the moment aren't traceable by dynamo, and probably the best medium term solution is to make them...

enhancement

CompiledModuleWrapper is implemented as a wrapper class for the model. I see a few issues when running unit tests with compile enabled. 1. isinstance(self.module, PipelineModule) used in multiple places in...

bug
training

**Describe the bug** * Enable BF16 training * Set gradient accumulation types to FP32 * Enable ZeRO-1, and CPU offload * Enable overlap_comm * Tune train batch size so that...

bug
training

If there are N GPUs, the snapshot will be N files for optimizer states. Each file corresponds to 1 GPU. (let me know if the understanding is not correct). Then,...

Is deepspeed compatible with AMD CPU ? When i import DeepSpeedCPUAdam optimizer on AMD CPU, I got following warning: [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD...

bug
rocm

**Describe the bug** I reviewed the initialization of self.gradient_accumulation_steps in the DeepSpeedConfig module when only train_batch and micro_batch are set (deepspeed Version: 0.13.1): ```python grad_acc = train_batch // micro_batch grad_acc...

bug
training

**Describe the bug** Hi, i'm trying to run pretraining gpt model with [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed?ysclid=ls8rr5jnv3799144357) pipeline and Zero-3 + Mics sharding strategy, but got next log: ``` WARNING: Runtime Error while waiting...

bug
training

Hello, I would like to ask for assistance in solving a problem I've encountered. I am currently training a MLLM with DeepSpeed, and I've introduced an additional modality to the...

PR#5104 (Remove optimizer step on initialization) breaks loading universal checkpoint for BF16_Optimizer. This is since universal checkpoint attempts to load the optimizer states into lp._hp_mapping.optim_state dictionary before they are initialized...

Latest checkout uses latest (non-deprecated) version of node (16 -> 20). More information [here](https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/): ``` Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3....