PaddleOCR
PaddleOCR copied to clipboard
SVTR中文模型训练报Segmentation Fault
请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem
- 系统环境/System Environment:Ubuntu 18.04/ Ubuntu20.04 / CentOS 7 均做过尝试
- 版本号/Version:Paddle:2.3.2 PaddleOCR:release/2.6
- 问题相关组件/Related components:tools/program.py; ppocr/optimizer/optimizer.py
- 运行指令/Command Code:python3 -m paddle.distributed.launch --log_dir=./debug/ --gpus '0' tools/train.py -c configs/rec/rec_svtr_tiny_6local_6global_stn_ch.yml 其中“rec_svtr_tiny_6local_6global_stn_ch.yaml”来自https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/doc/doc_ch/algorithm_rec_svtr.md中提供的中文模型的配置文件。
- 完整报错/Complete Error Message:
[2022/11/03 12:04:05] ppocr INFO: Initialize indexs of datasets:/mnt/nas101/datasets/private/CV/OCR/LMDB_datasets/text_recognition/chinese_benchmark/scene/test
W1103 12:04:06.059440 29920 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.4, Runtime API Version: 10.2
W1103 12:04:06.062000 29920 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2022/11/03 12:04:06] ppocr INFO: train dataloader has 3978 iters
[2022/11/03 12:04:06] ppocr INFO: valid dataloader has 249 iters
[2022/11/03 12:04:06] ppocr INFO: train from scratch
[2022/11/03 12:04:06] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 400 iterations
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
No stack trace in paddle, may be caused by external reasons.
----------------------
Error Message Summary:
----------------------
FatalError: `Segmentation fault` is detected by the operating system.
[TimeInfo: *** Aborted at 1667448247 (unix time) try "date -d @1667448247" if you are using GNU date ***]
[SignalInfo: *** SIGSEGV (@0x0) received by PID 29920 (TID 0x7f3f83bbc740) from PID 0 ***]
INFO 2022-11-03 12:04:13,547 launch_utils.py:343] terminate all the procs
INFO 2022-11-03 12:04:13,547 launch_utils.py:343] terminate all the procs
ERROR 2022-11-03 12:04:13,548 launch_utils.py:640] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.
ERROR 2022-11-03 12:04:13,548 launch_utils.py:640] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.
INFO 2022-11-03 12:04:17,552 launch_utils.py:343] terminate all the procs
INFO 2022-11-03 12:04:17,552 launch_utils.py:343] terminate all the procs
INFO 2022-11-03 12:04:17,553 launch.py:402] Local processes completed.
INFO 2022-11-03 12:04:17,553 launch.py:402] Local processes completed.
通过单步调试发现错误出现在tools/program.py的“optimizer.step()”一行,跳入调试发现错误出现在AdamW类调用的C_ops.adamw算子上。 其它信息:错误跟batch size有一定关系。较小的batch size不容易触发这个段错误。不同平台(系统)上触发该错误所需要的steps数和batch size大小均不同,无明显规律。
单卡训练,用这个命令试试python3 tools/train.py -c configs/rec/rec_svtr_tiny_6local_6global_stn_ch.yml
同样遇到的问题,请问解决了吗?batch小可以,大就报错,显存一半都没占满。
没有复现出此类问题,建议尝试最新版本paddlepaddle,或者尝试不同cuda环境,有可能是环境问题。 如果单卡训练,注意使用 @andyjpaddle 所建议的命令:
单卡训练,用这个命令试试
python3 tools/train.py -c configs/rec/rec_svtr_tiny_6local_6global_stn_ch.yml
如果需要指定卡号,建议如下命令:
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/rec/rec_svtr_tiny_6local_6global_stn_ch.yml