mindcv icon indicating copy to clipboard operation
mindcv copied to clipboard

[rexnet_x09] [Ascend910] [GRAPH] Distributed train failed

Open 787918582 opened this issue 2 years ago • 0 comments

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md

Describe the bug/ 问题描述 (Mandatory / 必填) rexnet_x09分布式训练过程中报错,精度也存在异常

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端: /device ascend

  • Software Environment / 软件环境 (Mandatory / 必填): -- MindSpore version (e.g., 1.7.0.Bxxx) :mindspore_v2.0.0 mindcv_0.2.2 -- Python version (e.g., Python 3.7.5) :3.7.5 -- OS platform and distribution (e.g., Linux Ubuntu 16.04):EulerOS2.8 -- GCC/Compiler version (if compiled from source):7.3.0

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式: /mode graph

To Reproduce / 重现步骤 (Mandatory / 必填) Steps to reproduce the behavior:

  1. mpirun --allow-run-as-root -n 8 python train.py --config configs/rexnet/rexnet_x09_ascend.yaml --distribute True --data_dir /ImageNet_Origin/

Expected behavior / 预期结果 (Mandatory / 必填) 修复报错,复现达标精度

Screenshots/ 日志 / 截图 (Mandatory / 必填) image image

Additional context / 备注 (Optional / 选填) Add any other context about the problem here.

787918582 avatar Jul 31 '23 02:07 787918582