mindcv icon indicating copy to clipboard operation
mindcv copied to clipboard

[shufflenetv1] [Ascend910] [GRAPH] Distributed train failed

Open 787918582 opened this issue 2 years ago • 2 comments

If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md

Describe the bug/ 问题描述 (Mandatory / 必填) shufflenet_v1_0_5 & shufflenet_v1_1_0执行分布式训练报错

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端: /device ascend

  • Software Environment / 软件环境 (Mandatory / 必填): -- MindSpore version (e.g., 1.7.0.Bxxx) :mindspore_v2.0.0 mindcv_0.2.2 -- Python version (e.g., Python 3.7.5) :3.7.5 -- OS platform and distribution (e.g., Linux Ubuntu 16.04):EulerOS2.8 -- GCC/Compiler version (if compiled from source):7.3.0

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式: /mode graph

To Reproduce / 重现步骤 (Mandatory / 必填) Steps to reproduce the behavior:

  1. mpirun --allow-run-as-root -n 8 python train.py --config configs/shufflenetv1/shufflenet_v1_1.0_ascend.yaml --distribute True --data_dir /ImageNet_Origin/

Expected behavior / 预期结果 (Mandatory / 必填) 可跑通完整分布式训练

Screenshots/ 日志 / 截图 (Mandatory / 必填) shufflenetv1

Additional context / 备注 (Optional / 选填) Add any other context about the problem here. v2.1.0、v2.2.0、v2.2.1均复现该报错

787918582 avatar Jul 04 '23 10:07 787918582

ms2.2.10.B180复现该报错

tacyi avatar Jan 10 '24 03:01 tacyi

MindSpore_v2.2.10.B180 训练也报错 RuntimeError: Found inconsistent format or data type! Op: Mul[@kernel_graph_2:207{[0]: ValueNode<Primitive> Mul, [1]: equiv_207, [2]: ValueNode<Tensor> Tensor(shape=[], dtype=Float32, value=0.04096)}],ame: Default/network-TrainOneStepCell/optimizer-Momentum/Mul-op1711

tacyi avatar Jan 22 '24 02:01 tacyi