PaddleOCR
PaddleOCR copied to clipboard
svtr训练错误
- 系统环境/System Environment:4 x A100, Nvidia 470.129.06, centos 7, 基于 paddle 官方的 docker 镜像 2.3.2-gpu-cuda10.2-cudnn7 安装的 PPOCR 2.6
- 版本号/Version:Paddle:2.3.2 PaddleOCR 2.6 问题相关组件/Related components: SVTR
- 运行指令/Command Code:
python tools/train.py -c configs/rec/rec_svtrnet.yml
- 完整报错/Complete Error Message:
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
[2022/11/09 11:01:31] ppocr INFO: Architecture :
[2022/11/09 11:01:31] ppocr INFO: Backbone :
[2022/11/09 11:01:31] ppocr INFO: depth : [3, 6, 3]
[2022/11/09 11:01:31] ppocr INFO: embed_dim : [64, 128, 256]
[2022/11/09 11:01:31] ppocr INFO: img_size : [32, 100]
[2022/11/09 11:01:31] ppocr INFO: last_stage : True
[2022/11/09 11:01:31] ppocr INFO: local_mixer : [[7, 11], [7, 11], [7, 11]]
[2022/11/09 11:01:31] ppocr INFO: mixer : ['Local', 'Local', 'Local', 'Local', 'Local', 'Local', 'Global', 'Global', 'Global', 'Global', 'Global', 'Global']
[2022/11/09 11:01:31] ppocr INFO: name : SVTRNet
[2022/11/09 11:01:31] ppocr INFO: num_heads : [2, 4, 8]
[2022/11/09 11:01:31] ppocr INFO: out_channels : 192
[2022/11/09 11:01:31] ppocr INFO: out_char_num : 25
[2022/11/09 11:01:31] ppocr INFO: patch_merging : Conv
[2022/11/09 11:01:31] ppocr INFO: prenorm : False
[2022/11/09 11:01:31] ppocr INFO: Head :
[2022/11/09 11:01:31] ppocr INFO: name : CTCHead
[2022/11/09 11:01:31] ppocr INFO: Neck :
[2022/11/09 11:01:31] ppocr INFO: encoder_type : reshape
[2022/11/09 11:01:31] ppocr INFO: name : SequenceEncoder
[2022/11/09 11:01:31] ppocr INFO: Transform :
[2022/11/09 11:01:31] ppocr INFO: name : STN_ON
[2022/11/09 11:01:31] ppocr INFO: num_control_points : 20
[2022/11/09 11:01:31] ppocr INFO: stn_activation : none
[2022/11/09 11:01:31] ppocr INFO: tps_inputsize : [32, 64]
[2022/11/09 11:01:31] ppocr INFO: tps_margins : [0.05, 0.05]
[2022/11/09 11:01:31] ppocr INFO: tps_outputsize : [32, 100]
[2022/11/09 11:01:31] ppocr INFO: algorithm : SVTR
[2022/11/09 11:01:31] ppocr INFO: model_type : rec
[2022/11/09 11:01:31] ppocr INFO: Eval :
[2022/11/09 11:01:31] ppocr INFO: dataset :
[2022/11/09 11:01:31] ppocr INFO: data_dir : /paddle/data/data_lmdb_release/evaluation/
[2022/11/09 11:01:31] ppocr INFO: name : LMDBDataSet
[2022/11/09 11:01:31] ppocr INFO: transforms :
[2022/11/09 11:01:31] ppocr INFO: DecodeImage :
[2022/11/09 11:01:31] ppocr INFO: channel_first : False
[2022/11/09 11:01:31] ppocr INFO: img_mode : BGR
[2022/11/09 11:01:31] ppocr INFO: CTCLabelEncode : None
[2022/11/09 11:01:31] ppocr INFO: SVTRRecResizeImg :
[2022/11/09 11:01:31] ppocr INFO: image_shape : [3, 64, 256]
[2022/11/09 11:01:31] ppocr INFO: padding : False
[2022/11/09 11:01:31] ppocr INFO: KeepKeys :
[2022/11/09 11:01:31] ppocr INFO: keep_keys : ['image', 'label', 'length']
[2022/11/09 11:01:31] ppocr INFO: loader :
[2022/11/09 11:01:31] ppocr INFO: batch_size_per_card : 256
[2022/11/09 11:01:31] ppocr INFO: drop_last : False
[2022/11/09 11:01:31] ppocr INFO: num_workers : 2
[2022/11/09 11:01:31] ppocr INFO: shuffle : False
[2022/11/09 11:01:31] ppocr INFO: Global :
[2022/11/09 11:01:31] ppocr INFO: cal_metric_during_train : True
[2022/11/09 11:01:31] ppocr INFO: character_dict_path : None
[2022/11/09 11:01:31] ppocr INFO: character_type : en
[2022/11/09 11:01:31] ppocr INFO: checkpoints : None
[2022/11/09 11:01:31] ppocr INFO: distributed : False
[2022/11/09 11:01:31] ppocr INFO: epoch_num : 20
[2022/11/09 11:01:31] ppocr INFO: eval_batch_step : [0, 2000]
[2022/11/09 11:01:31] ppocr INFO: infer_img : doc/imgs_words_en/word_10.png
[2022/11/09 11:01:31] ppocr INFO: infer_mode : False
[2022/11/09 11:01:31] ppocr INFO: log_smooth_window : 20
[2022/11/09 11:01:31] ppocr INFO: max_text_length : 25
[2022/11/09 11:01:31] ppocr INFO: pretrained_model : None
[2022/11/09 11:01:31] ppocr INFO: print_batch_step : 10
[2022/11/09 11:01:31] ppocr INFO: save_epoch_step : 1
[2022/11/09 11:01:31] ppocr INFO: save_inference_dir : None
[2022/11/09 11:01:31] ppocr INFO: save_model_dir : ./output/rec/svtr/
[2022/11/09 11:01:31] ppocr INFO: save_res_path : ./output/rec/predicts_svtr_tiny.txt
[2022/11/09 11:01:31] ppocr INFO: use_gpu : True
[2022/11/09 11:01:31] ppocr INFO: use_space_char : False
[2022/11/09 11:01:31] ppocr INFO: use_visualdl : False
[2022/11/09 11:01:31] ppocr INFO: Loss :
[2022/11/09 11:01:31] ppocr INFO: name : CTCLoss
[2022/11/09 11:01:31] ppocr INFO: Metric :
[2022/11/09 11:01:31] ppocr INFO: main_indicator : acc
[2022/11/09 11:01:31] ppocr INFO: name : RecMetric
[2022/11/09 11:01:31] ppocr INFO: Optimizer :
[2022/11/09 11:01:31] ppocr INFO: beta1 : 0.9
[2022/11/09 11:01:31] ppocr INFO: beta2 : 0.99
[2022/11/09 11:01:31] ppocr INFO: epsilon : 8e-08
[2022/11/09 11:01:31] ppocr INFO: lr :
[2022/11/09 11:01:31] ppocr INFO: learning_rate : 0.0005
[2022/11/09 11:01:31] ppocr INFO: name : Cosine
[2022/11/09 11:01:31] ppocr INFO: warmup_epoch : 2
[2022/11/09 11:01:31] ppocr INFO: name : AdamW
[2022/11/09 11:01:31] ppocr INFO: no_weight_decay_name : norm pos_embed
[2022/11/09 11:01:31] ppocr INFO: one_dim_param_no_weight_decay : True
[2022/11/09 11:01:31] ppocr INFO: weight_decay : 0.05
[2022/11/09 11:01:31] ppocr INFO: PostProcess :
[2022/11/09 11:01:31] ppocr INFO: name : CTCLabelDecode
[2022/11/09 11:01:31] ppocr INFO: Train :
[2022/11/09 11:01:31] ppocr INFO: dataset :
[2022/11/09 11:01:31] ppocr INFO: data_dir : /paddle/data/data_lmdb_release/training/
[2022/11/09 11:01:31] ppocr INFO: name : LMDBDataSet
[2022/11/09 11:01:31] ppocr INFO: transforms :
[2022/11/09 11:01:31] ppocr INFO: DecodeImage :
[2022/11/09 11:01:31] ppocr INFO: channel_first : False
[2022/11/09 11:01:31] ppocr INFO: img_mode : BGR
[2022/11/09 11:01:31] ppocr INFO: CTCLabelEncode : None
[2022/11/09 11:01:31] ppocr INFO: SVTRRecResizeImg :
[2022/11/09 11:01:31] ppocr INFO: image_shape : [3, 64, 256]
[2022/11/09 11:01:31] ppocr INFO: padding : False
[2022/11/09 11:01:31] ppocr INFO: KeepKeys :
[2022/11/09 11:01:31] ppocr INFO: keep_keys : ['image', 'label', 'length']
[2022/11/09 11:01:31] ppocr INFO: loader :
[2022/11/09 11:01:31] ppocr INFO: batch_size_per_card : 512
[2022/11/09 11:01:31] ppocr INFO: drop_last : True
[2022/11/09 11:01:31] ppocr INFO: num_workers : 4
[2022/11/09 11:01:31] ppocr INFO: shuffle : True
[2022/11/09 11:01:31] ppocr INFO: profiler_options : None
[2022/11/09 11:01:31] ppocr INFO: train with paddle 2.3.2 and device Place(gpu:0)
[2022/11/09 11:01:31] ppocr INFO: Initialize indexs of datasets:/paddle/data/data_lmdb_release/training/
[2022/11/09 11:01:43] ppocr WARNING: The character_dict_path is None, model can only recognize number and lower letters
[2022/11/09 11:01:43] ppocr INFO: Initialize indexs of datasets:/paddle/data/data_lmdb_release/evaluation/
[2022/11/09 11:01:43] ppocr WARNING: The character_dict_path is None, model can only recognize number and lower letters
W1109 11:01:43.789620 1229 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 11.4, Runtime API Version: 10.2
W1109 11:01:43.793349 1229 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6.
Traceback (most recent call last):
File "tools/train.py", line 208, in <module>
main(config, device, logger, vdl_writer)
File "tools/train.py", line 121, in main
model = build_model(config['Architecture'])
File "/paddle/PaddleOCR/ppocr/modeling/architectures/__init__.py", line 30, in build_model
arch = BaseModel(config)
File "/paddle/PaddleOCR/ppocr/modeling/architectures/base_model.py", line 46, in __init__
self.transform = build_transform(config['Transform'])
File "/paddle/PaddleOCR/ppocr/modeling/transforms/__init__.py", line 30, in build_transform
module_class = eval(module_name)(**config)
File "/paddle/PaddleOCR/ppocr/modeling/transforms/stn.py", line 123, in __init__
margins=tuple(tps_margins))
File "/paddle/PaddleOCR/ppocr/modeling/transforms/tps_spatial_transformer.py", line 106, in __init__
inverse_kernel = paddle.inverse(forward_kernel)
File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/tensor/math.py", line 1696, in inverse
return _C_ops.inverse(x)
RuntimeError: (PreconditionNotMet) For batch [0]: U(1, 1) is zero, singular U. Please check the matrix value and change it to a non-singular matrix
[Hint: Expected info[i] == 0, but received info[i]:1 != 0:0.] (at /paddle/paddle/phi/kernels/funcs/matrix_inverse.cu.cc:125)
[operator < inverse > error]
未复现出此类问题,请问是否修改了代码?
在这打断点后,
两种方法print出来的数值不一样:
未修改过任何代码,怀疑跟paddle和驱动/显卡/docker的兼容适配有关
尝试不使用GPU情况是否可以正常训练
把 use_gpu 关了是可以正常训练的
换了2.1.3-gpu-cuda11.2-cudnn8, 问题解决.