mindcv 模型训练过程中训练模式突然崩溃

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md

Describe the bug/ 问题描述 (Mandatory / 必填) 模型在训练过程中会突然崩溃

Hardware Environment(Ascend) / 硬件环境:
Software Environment / 软件环境 (Mandatory / 必填): -- MindSpore version (e.g., 1.10.1) : -- Python version (e.g., Python 3.7.10) :
Excute Mode / 执行模式 (Mandatory / 必填)(Graph):

To Reproduce / 重现步骤 (Mandatory / 必填) 偶发情况

Expected behavior / 预期结果 (Mandatory / 必填) 模型稳定并且正常收敛

Screenshots/ 日志 / 截图 (Mandatory / 必填) 917efa5d6cb67d9db0bbabc89125457

猜测： # todo: When to clip grad? Do we need to clip grad after grad reduction? What if grad accumulation is needed? if self.clip_grad: grads = ops.clip_by_global_norm(grads, clip_norm=self.clip_value) 梯度裁剪应当放在梯度的all_reduce之后，修改后未出现这种崩溃现象数学原因：

先clip 有的会clip 有的不会clip 然后再求all_redue（目前）
先求mean，然后整体clip 目前的方案可能存在梯度整体向量方向有偏差的问题，导致对于梯度敏感的模型训练出现问题

Aug 15 '23 04:08 JingyangXiang

duplicated of #603

Sep 20 '23 02:09 geniuspatrick

We are working on a new trainstep. However, it should be noted that custom train step is an experimental feature and may undergo incompatible changes in the future.

Sep 20 '23 02:09 geniuspatrick