DeepSpeed issues

[REQUEST] Add torchdynamo disable decorators to graph-break on collectives

1

Regarding this issue: https://github.com/pytorch/pytorch/issues/97079 There are some comm ops in deepspeed, which for the moment aren't traceable by dynamo, and probably the best medium term solution is to make them...

wconstab

enhancement

[BUG] CompiledModuleWrapper causing issues with checkpoints and pipeline module

1

CompiledModuleWrapper is implemented as a wrapper class for the model. I see a few issues when running unit tests with compile enabled. 1. isinstance(self.module, PipelineModule) used in multiple places in...

BacharL

bug

training

[BUG] FP32 gradient accumulation result in crash.

3

**Describe the bug** * Enable BF16 training * Set gradient accumulation types to FP32 * Enable ZeRO-1, and CPU offload * Enable overlap_comm * Tune train batch size so that...

torshie

bug

training

Question: how to continue the training with more or fewer GPUs

3

If there are N GPUs, the snapshot will be N files for optimizer states. Each file corresponds to 1 GPU. (let me know if the understanding is not correct). Then,...

amsword

[WARNING] [cpu_adam.py:84:init] FP16 params for CPUAdam may not work on AMD CPUs

2

Is deepspeed compatible with AMD CPU ? When i import DeepSpeedCPUAdam optimizer on AMD CPU, I got following warning: [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD...

ninghongbo123

bug

rocm

[BUG] Gradient Accumulation Steps Initialization Bug in Pipeline Parallel Mode

**Describe the bug** I reviewed the initialization of self.gradient_accumulation_steps in the DeepSpeedConfig module when only train_batch and micro_batch are set (deepspeed Version: 0.13.1)： ```python grad_acc = train_batch // micro_batch grad_acc...

fwyc0573

bug

training

[BUG] Problems with MiCS training

4

**Describe the bug** Hi, i'm trying to run pretraining gpt model with [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed?ysclid=ls8rr5jnv3799144357) pipeline and Zero-3 + Mics sharding strategy, but got next log: ``` WARNING: Runtime Error while waiting...

LoggerHead22

bug

training

[QUESTION/HELP] about ignore_unused_parameters hang

1

Hello, I would like to ask for assistance in solving a problem I've encountered. I am currently training a MLLM with DeepSpeed, and I've introduced an additional modality to the...

xxtars

Fix loading universal checkpoint for BF16_Optimizer

1

PR#5104 (Remove optimizer step on initialization) breaks loading universal checkpoint for BF16_Optimizer. This is since universal checkpoint attempts to load the optimizer states into lp._hp_mapping.optim_state dictionary before they are initialized...

mosheisland

Update checkout action to latest version

Latest checkout uses latest (non-deprecated) version of node (16 -> 20). More information [here](https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/): ``` Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3....

loadams

DeepSpeed
DeepSpeed copied to clipboard

Metadata

[REQUEST] Add torchdynamo disable decorators to graph-break on collectives

[BUG] CompiledModuleWrapper causing issues with checkpoints and pipeline module

[BUG] FP32 gradient accumulation result in crash.

Question: how to continue the training with more or fewer GPUs

[WARNING] [cpu_adam.py:84:init] FP16 params for CPUAdam may not work on AMD CPUs

[BUG] Gradient Accumulation Steps Initialization Bug in Pipeline Parallel Mode

[BUG] Problems with MiCS training

[QUESTION/HELP] about ignore_unused_parameters hang

Fix loading universal checkpoint for BF16_Optimizer

Update checkout action to latest version

← Metadata

Owner

Metadata

DeepSpeed DeepSpeed copied to clipboard

Metadata

← Metadata

Owner

Metadata

DeepSpeed
DeepSpeed copied to clipboard