模型训练过程中训练模式突然崩溃
If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md
Describe the bug/ 问题描述 (Mandatory / 必填) 模型在训练过程中会突然崩溃
-
Hardware Environment(
Ascend) / 硬件环境: -
Software Environment / 软件环境 (Mandatory / 必填): -- MindSpore version (e.g., 1.10.1) : -- Python version (e.g., Python 3.7.10) :
-
Excute Mode / 执行模式 (Mandatory / 必填)(
Graph):
To Reproduce / 重现步骤 (Mandatory / 必填) 偶发情况
Expected behavior / 预期结果 (Mandatory / 必填) 模型稳定并且正常收敛
Screenshots/ 日志 / 截图 (Mandatory / 必填)
猜测: # todo: When to clip grad? Do we need to clip grad after grad reduction? What if grad accumulation is needed? if self.clip_grad: grads = ops.clip_by_global_norm(grads, clip_norm=self.clip_value) 梯度裁剪应当放在梯度的all_reduce之后,修改后未出现这种崩溃现象 数学原因:
- 先clip 有的会clip 有的不会clip 然后再求all_redue(目前)
- 先求mean,然后整体clip 目前的方案可能存在梯度整体向量方向有偏差的问题,导致对于梯度敏感的模型训练出现问题
duplicated of #603
We are working on a new trainstep. However, it should be noted that custom train step is an experimental feature and may undergo incompatible changes in the future.