PaddleOCR SVTR中文模型训练报Segmentation Fault

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：Ubuntu 18.04/ Ubuntu20.04 / CentOS 7 均做过尝试
版本号/Version：Paddle：2.3.2 PaddleOCR：release/2.6
问题相关组件/Related components：tools/program.py; ppocr/optimizer/optimizer.py
运行指令/Command Code：python3 -m paddle.distributed.launch --log_dir=./debug/ --gpus '0' tools/train.py -c configs/rec/rec_svtr_tiny_6local_6global_stn_ch.yml 其中“rec_svtr_tiny_6local_6global_stn_ch.yaml”来自https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/doc/doc_ch/algorithm_rec_svtr.md中提供的中文模型的配置文件。
完整报错/Complete Error Message：


[2022/11/03 12:04:05] ppocr INFO: Initialize indexs of datasets:/mnt/nas101/datasets/private/CV/OCR/LMDB_datasets/text_recognition/chinese_benchmark/scene/test
W1103 12:04:06.059440 29920 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.4, Runtime API Version: 10.2
W1103 12:04:06.062000 29920 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2022/11/03 12:04:06] ppocr INFO: train dataloader has 3978 iters
[2022/11/03 12:04:06] ppocr INFO: valid dataloader has 249 iters
[2022/11/03 12:04:06] ppocr INFO: train from scratch
[2022/11/03 12:04:06] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 400 iterations


--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
No stack trace in paddle, may be caused by external reasons.

----------------------
Error Message Summary:
----------------------
FatalError: `Segmentation fault` is detected by the operating system.
  [TimeInfo: *** Aborted at 1667448247 (unix time) try "date -d @1667448247" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x0) received by PID 29920 (TID 0x7f3f83bbc740) from PID 0 ***]

INFO 2022-11-03 12:04:13,547 launch_utils.py:343] terminate all the procs
INFO 2022-11-03 12:04:13,547 launch_utils.py:343] terminate all the procs
ERROR 2022-11-03 12:04:13,548 launch_utils.py:640] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.
ERROR 2022-11-03 12:04:13,548 launch_utils.py:640] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.
INFO 2022-11-03 12:04:17,552 launch_utils.py:343] terminate all the procs
INFO 2022-11-03 12:04:17,552 launch_utils.py:343] terminate all the procs
INFO 2022-11-03 12:04:17,553 launch.py:402] Local processes completed.
INFO 2022-11-03 12:04:17,553 launch.py:402] Local processes completed.

通过单步调试发现错误出现在tools/program.py的“optimizer.step()”一行，跳入调试发现错误出现在AdamW类调用的C_ops.adamw算子上。其它信息：错误跟batch size有一定关系。较小的batch size不容易触发这个段错误。不同平台（系统）上触发该错误所需要的steps数和batch size大小均不同，无明显规律。

Nov 03 '22 04:11 zzyhlyoko

单卡训练，用这个命令试试python3 tools/train.py -c configs/rec/rec_svtr_tiny_6local_6global_stn_ch.yml

Nov 04 '22 08:11 andyjiang1116

同样遇到的问题，请问解决了吗？batch小可以，大就报错，显存一半都没占满。

Nov 21 '22 05:11 c-cn

没有复现出此类问题，建议尝试最新版本paddlepaddle，或者尝试不同cuda环境，有可能是环境问题。如果单卡训练，注意使用 @andyjpaddle 所建议的命令：

单卡训练，用这个命令试试python3 tools/train.py -c configs/rec/rec_svtr_tiny_6local_6global_stn_ch.yml

如果需要指定卡号，建议如下命令：

CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/rec/rec_svtr_tiny_6local_6global_stn_ch.yml

Nov 21 '22 06:11 Topdu

PaddleOCR PaddleOCR copied to clipboard

SVTR中文模型训练报Segmentation Fault

PaddleOCR
PaddleOCR copied to clipboard