[hrnet] [Ascend910] [GRAPH] Distributed train failed
If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md
Describe the bug/ 问题描述 (Mandatory / 必填) A clear and concise description of what the bug is. hrnet_w32、hrnet_w48执行静态图模式分布式训练均报错
-
Hardware Environment(
Ascend/GPU/CPU) / 硬件环境:
Please delete the backend not involved / 请删除不涉及的后端: /device ascend
-
Software Environment / 软件环境 (Mandatory / 必填): -- MindSpore version (e.g., 1.7.0.Bxxx) :mindspore_v2.2.1 mindcv_0.2.2 -- Python version (e.g., Python 3.7.5) :3.7.5 -- OS platform and distribution (e.g., Linux Ubuntu 16.04):EulerOS2.8 -- GCC/Compiler version (if compiled from source):7.3.0
-
Excute Mode / 执行模式 (Mandatory / 必填)(
PyNative/Graph):
Please delete the mode not involved / 请删除不涉及的模式: /mode graph
To Reproduce / 重现步骤 (Mandatory / 必填) Steps to reproduce the behavior:
- mpirun --allow-run-as-root -n 8 python train.py --config configs/hrnet/hrnet_w32_ascend.yaml --distribute True --data_dir /ImageNet_Origin/ Expected behavior / 预期结果 (Mandatory / 必填) 可跑通静态图分布式训练
Screenshots/ 日志 / 截图 (Mandatory / 必填)
If applicable, add screenshots to help explain your problem.
[2023-11-19 10:29:13] mindcv.scheduler.scheduler_factory WARNING - warmup_epochs + decay_epochs > num_epochs. Please check and reduce decay_epochs!
[2023-11-19 10:29:16] mindcv.train INFO - Essential Experiment Configurations:
MindSpore mode[GRAPH(0)/PYNATIVE(1)]: 0
Distributed mode: True
Number of devices: 8
Number of training samples: 800000
Number of validation samples: None
Number of classes: 1000
Number of batches: 781
Batch size: 128
Auto augment: randaug-m7-mstd0.5
MixUp: 0.2
CutMix: 1.0
Model: hrnet_w32
Model parameters: 41303464
Number of epochs: 5
Optimizer: adamw
Learning rate: 0.001
LR Scheduler: cosine_decay
Momentum: 0.9
Weight decay: 0.05
Auto mixed precision: O2
Loss scale: 1024(fixed)
[2023-11-19 10:29:16] mindcv.train INFO - Start training
[ERROR] PIPELINE(171895,ffff914f2190,python):2023-11-19-10:29:53.881.102 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171893,ffffbe9fb190,python):2023-11-19-10:29:54.378.528 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171887,ffff9b3ad190,python):2023-11-19-10:29:54.825.669 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171889,ffff87cee190,python):2023-11-19-10:29:55.189.347 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171890,ffff91938190,python):2023-11-19-10:29:55.439.711 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171894,ffff929f0190,python):2023-11-19-10:29:55.738.301 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171888,ffff8a2c7190,python):2023-11-19-10:29:56.666.323 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[ERROR] PIPELINE(171891,ffffb5509190,python):2023-11-19-10:29:57.019.842 [mindspore/ccsrc/pipeline/jit/ps/fallback.cc:464] GeneratePyExecuteNodeWithScriptSrc] Not found PyExecute input. script: x[i] = self.branchesi
[WARNING] MD(171895,fffc8ffff1e0,python):2023-11-19-10:30:19.682.318 [mindspore/ccsrc/minddata/dataset/engine/datasetops/data_queue_op.cc:1168] DetectPerBatchTime] Bad performance attention, it takes more than 25 seconds to fetch a batch of data from dataset pipeline, which might result GetNext timeout problem. You may test dataset processing performance(with creating dataset iterator) and optimize it.
Traceback (most recent call last):
File "/data3/zl/jenkins/workspace/Kits/source_code/mindcv//train.py", line 323, in
Additional context / 备注 (Optional / 选填) Add any other context about the problem here.
ms2.2.10.B180复现该报错
MindSpore_v2.2.10.B180 完整性训练成功