DeepSpeed [BUG]clip gradient not working

Describe the bug Hello, I used accelerate+deepspeed zero3 for distributed grpo training with 8 A800. For clipping gradient: I set max_grad_norm=1.0 in training arguments and set gradient_clipping=1.0 in deepspeed3.yaml. When training, many of the printed grad_norm values are greater than 1.0。 It seems that the above parameters do not work.

Expected behavior I think the grad_norm values should be less than max_grad_norm (or gradient_clipping)

System info (please complete the following information): trl 0.16.0 deepspeed 0.15.4 accelerate 0.34.0 pytorch 2.5.1 transformers 0.49.0

Apr 17 '25 01:04 jiangix-paper

I am having the same issue with accelerate+deepspeed zero2.

My config is: pytorch 2.5.1 accelerate 1.4.0 transformers 4.49.0 deepspeed 0.15.4 (also tested with 0.16.6, but the issue persists)

I am running in 4 A100 GPUs. When I try with a single GPU without deepspeed, gradient clipping works correctly.

This is my zero2 config:

debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: cpu
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

I tried adding gradient_clipping: 1.0 (besides the max_grad_norm in transformers), but there is no effect.

Apr 17 '25 17:04 luckeciano

Ok, after some investigation I found out that, at least in my case, gradient clipping is happening, but transformers trainer is logging the wrong thing.

In transformers' trainer, the grad norm is logged by calling get_global_grad_norm():

if (is_accelerate_available()
            and self.accelerator.distributed_type == DistributedType.DEEPSPEED):
            grad_norm = model.get_global_grad_norm()
            # In some cases the grad norm may not return a float
            if hasattr(grad_norm, "item"):
                grad_norm = grad_norm.item()
        else:
            grad_norm = _grad_norm

get_global_grad_norm() is a function that just returns _global_grad_norm from the engine.

This variable is set up in engine, in the following function:

def _take_model_step(self, lr_kwargs, block_eigenvalue={}):
        if self.gradient_clipping() > 0.0:
            if not (self.fp16_enabled() or self.bfloat16_enabled() or self.amp_enabled() or self.zero_optimization()):
                self.clip_fp32_gradients()
            elif self.amp_enabled():
                # AMP's recommended way of doing clipping
                # https://nvidia.github.io/apex/advanced.html#gradient-clipping
                master_params = amp.master_params(self.optimizer)
                clip_grad_norm_(parameters=master_params, max_norm=self.gradient_clipping(), mpu=self.mpu)
        self.optimizer.step()

        if hasattr(self.optimizer, '_global_grad_norm'):
            self._global_grad_norm = self.optimizer._global_grad_norm

...

However, the value of _global_grad_norm is the value before gradient clipping. So even though the clipping operation is performed, trainer is getting and logging the previous norm. I believe deepspeed should update this value accordingly, otherwise it might cause a lot of confusion in downstream applications.

Apr 17 '25 18:04 luckeciano

Hello, thanks for your reply.

In fact, this is my grpo training log picture. I have some questions for it:

Obviously, there are three points in the figure where grad_norm is greater than 1. If my max_grad_norm is set to 1, even if grad_norm shows the value before clipping, then after clipping, the loss should not fluctuate so much. In fact, the loss and grad_norm in the figure fluctuate almost synchronously. I think if clippingis done, should the loss be normal?

And, besides looking at the print log, how can I make sure that grad clip is actually working?

Apr 18 '25 01:04 jiangix-paper

have you solve this problem bro?I face the same problem o(╥﹏╥)o

May 15 '25 08:05 SeuZL

same problem using bf16+zero3 It seems deepspeed skips gradient clipping here in https://github.com/deepspeedai/DeepSpeed/blob/41fceadeeb41c1a95e2b3aeef4d04077a5902b20/deepspeed/runtime/engine.py#L2272

    def _take_model_step(self, lr_kwargs, block_eigenvalue={}):
        if self.gradient_clipping() > 0.0:
            if not (self.fp16_enabled() or self.bfloat16_enabled() or self.amp_enabled() or self.zero_optimization()):
                self.clip_fp32_gradients()
            elif self.amp_enabled():
                # AMP's recommended way of doing clipping
                # https://nvidia.github.io/apex/advanced.html#gradient-clipping
                master_params = amp.master_params(self.optimizer)
                clip_grad_norm_(parameters=master_params, max_norm=self.gradient_clipping(), mpu=self.mpu)
        self.optimizer.step()

Any solution?

May 21 '25 07:05 davidluciolu

Sorry for the mistake. This issue have pointed out that grad clip happens in optimizer.step() in deepspeed

May 21 '25 09:05 davidluciolu

抱歉，我犯了个错误。这个问题已经指出，梯度裁剪发生在 deepspeed 的optimizer.step()中。

Hello, I have also reviewed this issue and I am a bit lacking in knowledge about these very low-level things. However, based on my training logs, I can confirm that grad clip has not been performed. Do you know the reason for this？

May 21 '25 09:05 SeuZL

I also notice that the results I get are not right. The gradient norm can still be larger than the gradient_clipping I set.

Jun 05 '25 08:06 tingxueronghua

Ok, after some investigation I found out that, at least in my case, gradient clipping is happening, but transformers trainer is logging the wrong thing.

In transformers' trainer, the grad norm is logged by calling get_global_grad_norm():
if (is_accelerate_available()
            and self.accelerator.distributed_type == DistributedType.DEEPSPEED):
            grad_norm = model.get_global_grad_norm()
            # In some cases the grad norm may not return a float
            if hasattr(grad_norm, "item"):
                grad_norm = grad_norm.item()
        else:
            grad_norm = _grad_norm
get_global_grad_norm() is a function that just returns _global_grad_norm from the engine.

This variable is set up in engine, in the following function:
def _take_model_step(self, lr_kwargs, block_eigenvalue={}):
        if self.gradient_clipping() > 0.0:
            if not (self.fp16_enabled() or self.bfloat16_enabled() or self.amp_enabled() or self.zero_optimization()):
                self.clip_fp32_gradients()
            elif self.amp_enabled():
                # AMP's recommended way of doing clipping
                # https://nvidia.github.io/apex/advanced.html#gradient-clipping
                master_params = amp.master_params(self.optimizer)
                clip_grad_norm_(parameters=master_params, max_norm=self.gradient_clipping(), mpu=self.mpu)
        self.optimizer.step()

        if hasattr(self.optimizer, '_global_grad_norm'):
            self._global_grad_norm = self.optimizer._global_grad_norm

...
However, the value of _global_grad_norm is the value before gradient clipping. So even though the clipping operation is performed, trainer is getting and logging the previous norm. I believe deepspeed should update this value accordingly, otherwise it might cause a lot of confusion in downstream applications.

Correct analysis, maybe you can recalculate the value of _global_grad_norm yourself after clipping, the transformers library will read the new value and thus display the correct gradient norm on the log.

Jun 13 '25 03:06 Parsifal133

好的，经过一番调查后我发现，至少在我的情况下，梯度剪辑正在发生，但变形金刚训练器记录了错误的内容。在 Transformer 的训练器中，通过调用以下命令记录梯度范数get_global_grad_norm()：
if (is_accelerate_available()
            and self.accelerator.distributed_type == DistributedType.DEEPSPEED):
            grad_norm = model.get_global_grad_norm()
            # In some cases the grad norm may not return a float
            if hasattr(grad_norm, "item"):
                grad_norm = grad_norm.item()
        else:
            grad_norm = _grad_norm
get_global_grad_norm()``_global_grad_norm是一个刚从引擎返回的函数。 engine该变量在以下函数中设置：
def _take_model_step(self, lr_kwargs, block_eigenvalue={}):
        if self.gradient_clipping() > 0.0:
            if not (self.fp16_enabled() or self.bfloat16_enabled() or self.amp_enabled() or self.zero_optimization()):
                self.clip_fp32_gradients()
            elif self.amp_enabled():
                # AMP's recommended way of doing clipping
                # https://nvidia.github.io/apex/advanced.html#gradient-clipping
                master_params = amp.master_params(self.optimizer)
                clip_grad_norm_(parameters=master_params, max_norm=self.gradient_clipping(), mpu=self.mpu)
        self.optimizer.step()

        if hasattr(self.optimizer, '_global_grad_norm'):
            self._global_grad_norm = self.optimizer._global_grad_norm

...
然而，的值是梯度裁剪_之前的__global_grad_norm值。因此，即使执行了裁剪操作，训练器也会获取并记录之前的范数。我认为 Deepspeed 应该相应地更新这个值，否则可能会在下游应用程序中造成很大的混乱。__
正确分析，也许你可以在剪辑之后重新计算自己的值_global_grad_norm，transformers库将读取新的值，从而在日志上显示正确的梯度范数。

What a great job, thank you so much. Then I can use it with confidence

Jun 13 '25 03:06 SeuZL

also, i found that if deepspeed is enabled, the max_grad_norm set in transformers trainer will not be used...hard to know😭

Aug 20 '25 11:08 zyandtom

Ok, after some investigation I found out that, at least in my case, gradient clipping is happening, but transformers trainer is logging the wrong thing. In transformers' trainer, the grad norm is logged by calling get_global_grad_norm():
if (is_accelerate_available()
            and self.accelerator.distributed_type == DistributedType.DEEPSPEED):
            grad_norm = model.get_global_grad_norm()
            # In some cases the grad norm may not return a float
            if hasattr(grad_norm, "item"):
                grad_norm = grad_norm.item()
        else:
            grad_norm = _grad_norm
get_global_grad_norm() is a function that just returns _global_grad_norm from the engine. This variable is set up in engine, in the following function:
def _take_model_step(self, lr_kwargs, block_eigenvalue={}):
        if self.gradient_clipping() > 0.0:
            if not (self.fp16_enabled() or self.bfloat16_enabled() or self.amp_enabled() or self.zero_optimization()):
                self.clip_fp32_gradients()
            elif self.amp_enabled():
                # AMP's recommended way of doing clipping
                # https://nvidia.github.io/apex/advanced.html#gradient-clipping
                master_params = amp.master_params(self.optimizer)
                clip_grad_norm_(parameters=master_params, max_norm=self.gradient_clipping(), mpu=self.mpu)
        self.optimizer.step()

        if hasattr(self.optimizer, '_global_grad_norm'):
            self._global_grad_norm = self.optimizer._global_grad_norm

...
However, the value of _global_grad_norm is the value before gradient clipping. So even though the clipping operation is performed, trainer is getting and logging the previous norm. I believe deepspeed should update this value accordingly, otherwise it might cause a lot of confusion in downstream applications.
Correct analysis, maybe you can recalculate the value of _global_grad_norm yourself after clipping, the transformers library will read the new value and thus display the correct gradient norm on the log.正确的分析，也许你可以在裁剪后自己重新计算 _global_grad_norm 的值，transformers 库将读取新值，从而在日志中显示正确的梯度范数。

我们新建了一个仓库，把打印裁剪后梯度范数的功能实现了；详见：https://github.com/SinovatioAI/llamafactory_deepspeed_clipped_grad_inspect

Nov 06 '25 06:11 gysabc