mindcv icon indicating copy to clipboard operation
mindcv copied to clipboard

模型训练过程中训练模式突然崩溃

Open JingyangXiang opened this issue 2 years ago • 2 comments

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md

Describe the bug/ 问题描述 (Mandatory / 必填) 模型在训练过程中会突然崩溃

  • Hardware Environment(Ascend) / 硬件环境:

  • Software Environment / 软件环境 (Mandatory / 必填): -- MindSpore version (e.g., 1.10.1) : -- Python version (e.g., Python 3.7.10) :

  • Excute Mode / 执行模式 (Mandatory / 必填)(Graph):

To Reproduce / 重现步骤 (Mandatory / 必填) 偶发情况

Expected behavior / 预期结果 (Mandatory / 必填) 模型稳定并且正常收敛

Screenshots/ 日志 / 截图 (Mandatory / 必填) 917efa5d6cb67d9db0bbabc89125457

猜测: # todo: When to clip grad? Do we need to clip grad after grad reduction? What if grad accumulation is needed? if self.clip_grad: grads = ops.clip_by_global_norm(grads, clip_norm=self.clip_value) 梯度裁剪应当放在梯度的all_reduce之后,修改后未出现这种崩溃现象 数学原因:

  1. 先clip 有的会clip 有的不会clip 然后再求all_redue(目前)
  2. 先求mean,然后整体clip 目前的方案可能存在梯度整体向量方向有偏差的问题,导致对于梯度敏感的模型训练出现问题

JingyangXiang avatar Aug 15 '23 04:08 JingyangXiang

duplicated of #603

geniuspatrick avatar Sep 20 '23 02:09 geniuspatrick

We are working on a new trainstep. However, it should be noted that custom train step is an experimental feature and may undergo incompatible changes in the future.

geniuspatrick avatar Sep 20 '23 02:09 geniuspatrick