PaddleOCR icon indicating copy to clipboard operation
PaddleOCR copied to clipboard

svtr训练错误

Open willpat1213 opened this issue 2 years ago • 6 comments

  • 系统环境/System Environment:4 x A100, Nvidia 470.129.06, centos 7, 基于 paddle 官方的 docker 镜像 2.3.2-gpu-cuda10.2-cudnn7 安装的 PPOCR 2.6
  • 版本号/Version:Paddle:2.3.2 PaddleOCR 2.6 问题相关组件/Related components: SVTR
  • 运行指令/Command Code:
python  tools/train.py -c configs/rec/rec_svtrnet.yml
  • 完整报错/Complete Error Message:
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
[2022/11/09 11:01:31] ppocr INFO: Architecture :
[2022/11/09 11:01:31] ppocr INFO:     Backbone :
[2022/11/09 11:01:31] ppocr INFO:         depth : [3, 6, 3]
[2022/11/09 11:01:31] ppocr INFO:         embed_dim : [64, 128, 256]
[2022/11/09 11:01:31] ppocr INFO:         img_size : [32, 100]
[2022/11/09 11:01:31] ppocr INFO:         last_stage : True
[2022/11/09 11:01:31] ppocr INFO:         local_mixer : [[7, 11], [7, 11], [7, 11]]
[2022/11/09 11:01:31] ppocr INFO:         mixer : ['Local', 'Local', 'Local', 'Local', 'Local', 'Local', 'Global', 'Global', 'Global', 'Global', 'Global', 'Global']
[2022/11/09 11:01:31] ppocr INFO:         name : SVTRNet
[2022/11/09 11:01:31] ppocr INFO:         num_heads : [2, 4, 8]
[2022/11/09 11:01:31] ppocr INFO:         out_channels : 192
[2022/11/09 11:01:31] ppocr INFO:         out_char_num : 25
[2022/11/09 11:01:31] ppocr INFO:         patch_merging : Conv
[2022/11/09 11:01:31] ppocr INFO:         prenorm : False
[2022/11/09 11:01:31] ppocr INFO:     Head :
[2022/11/09 11:01:31] ppocr INFO:         name : CTCHead
[2022/11/09 11:01:31] ppocr INFO:     Neck :
[2022/11/09 11:01:31] ppocr INFO:         encoder_type : reshape
[2022/11/09 11:01:31] ppocr INFO:         name : SequenceEncoder
[2022/11/09 11:01:31] ppocr INFO:     Transform :
[2022/11/09 11:01:31] ppocr INFO:         name : STN_ON
[2022/11/09 11:01:31] ppocr INFO:         num_control_points : 20
[2022/11/09 11:01:31] ppocr INFO:         stn_activation : none
[2022/11/09 11:01:31] ppocr INFO:         tps_inputsize : [32, 64]
[2022/11/09 11:01:31] ppocr INFO:         tps_margins : [0.05, 0.05]
[2022/11/09 11:01:31] ppocr INFO:         tps_outputsize : [32, 100]
[2022/11/09 11:01:31] ppocr INFO:     algorithm : SVTR
[2022/11/09 11:01:31] ppocr INFO:     model_type : rec
[2022/11/09 11:01:31] ppocr INFO: Eval :
[2022/11/09 11:01:31] ppocr INFO:     dataset :
[2022/11/09 11:01:31] ppocr INFO:         data_dir : /paddle/data/data_lmdb_release/evaluation/
[2022/11/09 11:01:31] ppocr INFO:         name : LMDBDataSet
[2022/11/09 11:01:31] ppocr INFO:         transforms :
[2022/11/09 11:01:31] ppocr INFO:             DecodeImage :
[2022/11/09 11:01:31] ppocr INFO:                 channel_first : False
[2022/11/09 11:01:31] ppocr INFO:                 img_mode : BGR
[2022/11/09 11:01:31] ppocr INFO:             CTCLabelEncode : None
[2022/11/09 11:01:31] ppocr INFO:             SVTRRecResizeImg :
[2022/11/09 11:01:31] ppocr INFO:                 image_shape : [3, 64, 256]
[2022/11/09 11:01:31] ppocr INFO:                 padding : False
[2022/11/09 11:01:31] ppocr INFO:             KeepKeys :
[2022/11/09 11:01:31] ppocr INFO:                 keep_keys : ['image', 'label', 'length']
[2022/11/09 11:01:31] ppocr INFO:     loader :
[2022/11/09 11:01:31] ppocr INFO:         batch_size_per_card : 256
[2022/11/09 11:01:31] ppocr INFO:         drop_last : False
[2022/11/09 11:01:31] ppocr INFO:         num_workers : 2
[2022/11/09 11:01:31] ppocr INFO:         shuffle : False
[2022/11/09 11:01:31] ppocr INFO: Global :
[2022/11/09 11:01:31] ppocr INFO:     cal_metric_during_train : True
[2022/11/09 11:01:31] ppocr INFO:     character_dict_path : None
[2022/11/09 11:01:31] ppocr INFO:     character_type : en
[2022/11/09 11:01:31] ppocr INFO:     checkpoints : None
[2022/11/09 11:01:31] ppocr INFO:     distributed : False
[2022/11/09 11:01:31] ppocr INFO:     epoch_num : 20
[2022/11/09 11:01:31] ppocr INFO:     eval_batch_step : [0, 2000]
[2022/11/09 11:01:31] ppocr INFO:     infer_img : doc/imgs_words_en/word_10.png
[2022/11/09 11:01:31] ppocr INFO:     infer_mode : False
[2022/11/09 11:01:31] ppocr INFO:     log_smooth_window : 20
[2022/11/09 11:01:31] ppocr INFO:     max_text_length : 25
[2022/11/09 11:01:31] ppocr INFO:     pretrained_model : None
[2022/11/09 11:01:31] ppocr INFO:     print_batch_step : 10
[2022/11/09 11:01:31] ppocr INFO:     save_epoch_step : 1
[2022/11/09 11:01:31] ppocr INFO:     save_inference_dir : None
[2022/11/09 11:01:31] ppocr INFO:     save_model_dir : ./output/rec/svtr/
[2022/11/09 11:01:31] ppocr INFO:     save_res_path : ./output/rec/predicts_svtr_tiny.txt
[2022/11/09 11:01:31] ppocr INFO:     use_gpu : True
[2022/11/09 11:01:31] ppocr INFO:     use_space_char : False
[2022/11/09 11:01:31] ppocr INFO:     use_visualdl : False
[2022/11/09 11:01:31] ppocr INFO: Loss :
[2022/11/09 11:01:31] ppocr INFO:     name : CTCLoss
[2022/11/09 11:01:31] ppocr INFO: Metric :
[2022/11/09 11:01:31] ppocr INFO:     main_indicator : acc
[2022/11/09 11:01:31] ppocr INFO:     name : RecMetric
[2022/11/09 11:01:31] ppocr INFO: Optimizer :
[2022/11/09 11:01:31] ppocr INFO:     beta1 : 0.9
[2022/11/09 11:01:31] ppocr INFO:     beta2 : 0.99
[2022/11/09 11:01:31] ppocr INFO:     epsilon : 8e-08
[2022/11/09 11:01:31] ppocr INFO:     lr :
[2022/11/09 11:01:31] ppocr INFO:         learning_rate : 0.0005
[2022/11/09 11:01:31] ppocr INFO:         name : Cosine
[2022/11/09 11:01:31] ppocr INFO:         warmup_epoch : 2
[2022/11/09 11:01:31] ppocr INFO:     name : AdamW
[2022/11/09 11:01:31] ppocr INFO:     no_weight_decay_name : norm pos_embed
[2022/11/09 11:01:31] ppocr INFO:     one_dim_param_no_weight_decay : True
[2022/11/09 11:01:31] ppocr INFO:     weight_decay : 0.05
[2022/11/09 11:01:31] ppocr INFO: PostProcess :
[2022/11/09 11:01:31] ppocr INFO:     name : CTCLabelDecode
[2022/11/09 11:01:31] ppocr INFO: Train :
[2022/11/09 11:01:31] ppocr INFO:     dataset :
[2022/11/09 11:01:31] ppocr INFO:         data_dir : /paddle/data/data_lmdb_release/training/
[2022/11/09 11:01:31] ppocr INFO:         name : LMDBDataSet
[2022/11/09 11:01:31] ppocr INFO:         transforms :
[2022/11/09 11:01:31] ppocr INFO:             DecodeImage :
[2022/11/09 11:01:31] ppocr INFO:                 channel_first : False
[2022/11/09 11:01:31] ppocr INFO:                 img_mode : BGR
[2022/11/09 11:01:31] ppocr INFO:             CTCLabelEncode : None
[2022/11/09 11:01:31] ppocr INFO:             SVTRRecResizeImg :
[2022/11/09 11:01:31] ppocr INFO:                 image_shape : [3, 64, 256]
[2022/11/09 11:01:31] ppocr INFO:                 padding : False
[2022/11/09 11:01:31] ppocr INFO:             KeepKeys :
[2022/11/09 11:01:31] ppocr INFO:                 keep_keys : ['image', 'label', 'length']
[2022/11/09 11:01:31] ppocr INFO:     loader :
[2022/11/09 11:01:31] ppocr INFO:         batch_size_per_card : 512
[2022/11/09 11:01:31] ppocr INFO:         drop_last : True
[2022/11/09 11:01:31] ppocr INFO:         num_workers : 4
[2022/11/09 11:01:31] ppocr INFO:         shuffle : True
[2022/11/09 11:01:31] ppocr INFO: profiler_options : None
[2022/11/09 11:01:31] ppocr INFO: train with paddle 2.3.2 and device Place(gpu:0)
[2022/11/09 11:01:31] ppocr INFO: Initialize indexs of datasets:/paddle/data/data_lmdb_release/training/
[2022/11/09 11:01:43] ppocr WARNING: The character_dict_path is None, model can only recognize number and lower letters
[2022/11/09 11:01:43] ppocr INFO: Initialize indexs of datasets:/paddle/data/data_lmdb_release/evaluation/
[2022/11/09 11:01:43] ppocr WARNING: The character_dict_path is None, model can only recognize number and lower letters
W1109 11:01:43.789620  1229 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 11.4, Runtime API Version: 10.2
W1109 11:01:43.793349  1229 gpu_resources.cc:91] device: 0, cuDNN Version: 7.6.
Traceback (most recent call last):
  File "tools/train.py", line 208, in <module>
    main(config, device, logger, vdl_writer)
  File "tools/train.py", line 121, in main
    model = build_model(config['Architecture'])
  File "/paddle/PaddleOCR/ppocr/modeling/architectures/__init__.py", line 30, in build_model
    arch = BaseModel(config)
  File "/paddle/PaddleOCR/ppocr/modeling/architectures/base_model.py", line 46, in __init__
    self.transform = build_transform(config['Transform'])
  File "/paddle/PaddleOCR/ppocr/modeling/transforms/__init__.py", line 30, in build_transform
    module_class = eval(module_name)(**config)
  File "/paddle/PaddleOCR/ppocr/modeling/transforms/stn.py", line 123, in __init__
    margins=tuple(tps_margins))
  File "/paddle/PaddleOCR/ppocr/modeling/transforms/tps_spatial_transformer.py", line 106, in __init__
    inverse_kernel = paddle.inverse(forward_kernel)
  File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/tensor/math.py", line 1696, in inverse
    return _C_ops.inverse(x)
RuntimeError: (PreconditionNotMet) For batch [0]: U(1, 1) is zero, singular U. Please check the matrix value and change it to a non-singular matrix
  [Hint: Expected info[i] == 0, but received info[i]:1 != 0:0.] (at /paddle/paddle/phi/kernels/funcs/matrix_inverse.cu.cc:125)
  [operator < inverse > error]

willpat1213 avatar Nov 10 '22 02:11 willpat1213

未复现出此类问题,请问是否修改了代码?

Topdu avatar Nov 13 '22 13:11 Topdu

在这打断点后, image 两种方法print出来的数值不一样: image

willpat1213 avatar Nov 13 '22 16:11 willpat1213

未修改过任何代码,怀疑跟paddle和驱动/显卡/docker的兼容适配有关

willpat1213 avatar Nov 14 '22 02:11 willpat1213

尝试不使用GPU情况是否可以正常训练

Topdu avatar Nov 14 '22 02:11 Topdu

把 use_gpu 关了是可以正常训练的

willpat1213 avatar Nov 14 '22 04:11 willpat1213

换了2.1.3-gpu-cuda11.2-cudnn8, 问题解决.

willpat1213 avatar Nov 14 '22 10:11 willpat1213

换了2.1.3-gpu-cuda11.2-cudnn8, 问题解决.

Not works for me.

aishoot avatar Mar 15 '23 16:03 aishoot