mindyolo icon indicating copy to clipboard operation
mindyolo copied to clipboard

[yolov3] [Ascend910a] [GRAPH] Distributed train failed

Open 787918582 opened this issue 2 years ago • 2 comments

Environment

Hardware Environment(Ascend/GPU/CPU):

Uncomment only one /device <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/device ascend

Software Environment:

  • MindSpore version (source or binary):mindspore_v2.1.0
  • Python version (e.g., Python 3.7.5):Python3.7.5
  • OS platform and distribution (e.g., Linux Ubuntu 16.04):EulerOS2.8
  • GCC/Compiler version (if compiled from source): 7.3.0

Describe the current behavior

设置run_eval=True,yolov3、yolov4、yolov5、yolov7、yolov8等模型的全量分布式训练均会报错

Describe the expected behavior

设置run_eval=True时可正常完成分布式训练

Steps to reproduce the issue

  1. mpirun --allow-run-as-root -n 8 python /data3/zl/tmp/source_code/mindyolo//train.py --config /data3/zl/tmp/source_code/mindyolo/configs/yolov4/yolov4.yaml --is_parallel True --ms_mode 0 --device_target Ascend --keep_checkpoint_max 300 --run_eval True --weight /nfs_for_sync/Mindlab_data/dataset/preckpt/yolov4/yolov4_backbone.ckpt

Related log / screenshot

2023-11-03 17:50:49,472 [INFO] Epoch 2/100, Step 600/924, imgsize (640, 640), loss: 5.2299, lbox: 1.3107, lcls: 2.3511, dfl: 1.5682, cur_lr: 0.05046505108475685 2023-11-03 17:50:49,478 [INFO] Epoch 2/100, Step 600/924, step time: 305.22 ms [m2301007135:374576] *** Process received signal *** [m2301007135:374576] Signal: Segmentation fault (11) [m2301007135:374576] Signal code: Address not mapped (1) [m2301007135:374576] Failing at address: 0x55d64a257900 [m2301007135:374576] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7fcd5a700420] [m2301007135:374576] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x18b8f5)[0x7fcd5a6858f5] [m2301007135:374576] [ 2] /root/miniconda3/envs/Python380/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0x130e22)[0x7fcd5963ce22] [m2301007135:374576] [ 3] /root/miniconda3/envs/Python380/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0x9d29c)[0x7fcd595a929c] [m2301007135:374576] [ 4] /root/miniconda3/envs/Python380/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0x9dd7c)[0x7fcd595a9d7c] [m2301007135:374576] [ 5] /root/miniconda3/envs/Python380/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0xaa67f)[0x7fcd595b667f] [m2301007135:374576] [ 6] /root/miniconda3/envs/Python380/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so(+0x14ee2c)[0x7fcd5965ae2c] [m2301007135:374576] [ 7] python(PyCFunction_Call+0xdb)[0x55d6362236bb] [m2301007135:374576] [ 8] python(_PyObject_MakeTpCall+0x22f)[0x55d6361e004f] [m2301007135:374576] [ 9] python(_PyEval_EvalFrameDefault+0x485)[0x55d636267395] [m2301007135:374576] [10] python(_PyEval_EvalCodeWithName+0x2d2)[0x55d63622e7c2] [m2301007135:374576] [11] python(_PyFunction_Vectorcall+0x1e3)[0x55d63622f7c3] [m2301007135:374576] [12] python(+0xfba8d)[0x55d6361a8a8d] [m2301007135:374576] [13] python(_PyEval_EvalCodeWithName+0x2d2)[0x55d63622e7c2] [m2301007135:374576] [14] python(_PyFunction_Vectorcall+0x1e3)[0x55d63622f7c3] [m2301007135:374576] [15] python(+0xfba8d)[0x55d6361a8a8d] [m2301007135:374576] [16] python(_PyEval_EvalCodeWithName+0x2d2)[0x55d63622e7c2] [m2301007135:374576] [17] python(_PyFunction_Vectorcall+0x1e3)[0x55d63622f7c3] [m2301007135:374576] [18] python(+0xfba8d)[0x55d6361a8a8d] [m2301007135:374576] [19] python(_PyEval_EvalCodeWithName+0x2d2)[0x55d63622e7c2] [m2301007135:374576] [20] python(_PyFunction_Vectorcall+0x1e3)[0x55d63622f7c3] [m2301007135:374576] [21] python(+0xfba8d)[0x55d6361a8a8d] [m2301007135:374576] [22] python(_PyEval_EvalCodeWithName+0x2d2)[0x55d63622e7c2] [m2301007135:374576] [23] python(_PyFunction_Vectorcall+0x1e3)[0x55d63622f7c3] [m2301007135:374576] [24] python(+0xfba8d)[0x55d6361a8a8d] [m2301007135:374576] [25] python(_PyFunction_Vectorcall+0x10b)[0x55d63622f6eb] [m2301007135:374576] [26] python(+0xfaf50)[0x55d6361a7f50] [m2301007135:374576] [27] python(_PyFunction_Vectorcall+0x10b)[0x55d63622f6eb] [m2301007135:374576] [28] python(+0xb4ca6)[0x55d636161ca6] [m2301007135:374576] [29] python(+0x1752b2)[0x55d6362222b2] [m2301007135:374576] *** End of error message *** [CRITICAL] ME(365854:140510318556928,MainProcess):2023-11-03-17:51:33.361.262 [mindspore/dataset/engine/datasets.py:3312] The subprocess of dataset may exit unexpected or be killed, main process will exit. If this is not an artificial operation, you can use ds.config.set_enable_watchdog(False) to block this error.

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

[m2301007135:374291] *** Process received signal *** [m2301007135:374291] Signal: Segmentation fault (11) [m2301007135:374291] Signal code: Address not mapped (1) [m2301007135:374291] Failing at address: 0x55de0d5a9008 [m2301007135:374291] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f823a29b420] [m2301007135:374291] [ 1] /lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x33)[0x7f823a2cca53] [m2301007135:374291] [ 2] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7242)[0x7f823a28e242] [m2301007135:374291] [ 3] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8522)[0x7f823a28f522] [m2301007135:374291] [ 4] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8882)[0x7f823a28f882] [m2301007135:374291] [ 5] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f823a1b4133] [m2301007135:374291] *** End of error message *** [m2301007135:374286] *** Process received signal *** [m2301007135:374286] Signal: Segmentation fault (11) [m2301007135:374286] Signal code: Address not mapped (1) [m2301007135:374286] Failing at address: 0x55de0d5a9008 [m2301007135:374286] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f823a29b420] [m2301007135:374286] [ 1] /lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x33)[0x7f823a2cca53] [m2301007135:374286] [ 2] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7242)[0x7f823a28e242] [m2301007135:374286] [ 3] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8522)[0x7f823a28f522] [m2301007135:374286] [ 4] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8882)[0x7f823a28f882] [m2301007135:374286] [ 5] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f823a1b4133] [m2301007135:374286] *** End of error message *** [m2301007135:374284] *** Process received signal *** [m2301007135:374284] Signal: Segmentation fault (11) [m2301007135:374284] Signal code: Address not mapped (1) [m2301007135:374284] Failing at address: 0x55de0d5a9008 [m2301007135:374284] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f823a29b420] [m2301007135:374284] [ 1] /lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x33)[0x7f823a2cca53] [m2301007135:374284] [ 2] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7242)[0x7f823a28e242] [m2301007135:374284] [ 3] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8522)[0x7f823a28f522] [m2301007135:374284] [ 4] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8882)[0x7f823a28f882] [m2301007135:374284] [ 5] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f823a1b4133] [m2301007135:374284] *** End of error message *** [m2301007135:374310] *** Process received signal *** [m2301007135:374310] Signal: Segmentation fault (11) [m2301007135:374310] Signal code: Address not mapped (1) [m2301007135:374310] Failing at address: 0x55de0d5a9008 [m2301007135:374310] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f823a29b420] [m2301007135:374310] [ 1] /lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x33)[0x7f823a2cca53] [m2301007135:374310] [ 2] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7242)[0x7f823a28e242] [m2301007135:374310] [ 3] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8522)[0x7f823a28f522] [m2301007135:374310] [ 4] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8882)[0x7f823a28f882] [m2301007135:374310] [ 5] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f823a1b4133] [m2301007135:374310] *** End of error message *** [m2301007135:374315] *** Process received signal *** [m2301007135:374315] Signal: Segmentation fault (11) [m2301007135:374315] Signal code: Address not mapped (1) [m2301007135:374315] Failing at address: 0x55de0d5a9008 [m2301007135:374315] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f823a29b420] [m2301007135:374315] [ 1] /lib64/ld-linux-x86-64.so.2(_dl_deallocate_tls+0x33)[0x7f823a2cca53] [m2301007135:374315] [ 2] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7242)[0x7f823a28e242] [m2301007135:374315] [ 3] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8522)[0x7f823a28f522] [m2301007135:374315] [ 4] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8882)[0x7f823a28f882] [m2301007135:374315] [ 5] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f823a1b4133] [m2301007135:374315] *** End of error message *** [m2301007135:374299] *** Process received signal *** [m230100[WARNING] ME(374725:140026391783232,WriterPool-31):2023-11-03-17:51:33.637.685 [mindspore/train/summary/_writer_pool.py:193] The training process 365854 has exited, summary process will exit.

Special notes for this issue

787918582 avatar Nov 15 '23 02:11 787918582

run_eval 功能的问题我们看下

zhanghuiyao avatar Nov 15 '23 02:11 zhanghuiyao

已在pr中解决

Mark-ZhouWX avatar Nov 24 '23 02:11 Mark-ZhouWX