PaddleOCR SVTR识别模型推理阶段输出与之前不一致

SVTR识别模型在转为推理模型之前，在数据集上测试的结果很好，但是通过export_model转换为inference模型之后，运行predict_rec.py进行识别，识别模型效果就差了许多，不知道这是为什么 paddle版本是2.4，cuda版本为11.3，cudnn版本为8.2

P.S. 在predict_rec命令后面加速 --use_tensorrt=True，也就是使用tensorrt加速推理，模型的输出变得更加奇怪了，基本上不正确，但是使用其他模型比如PPOCR v3中的识别模型就没问题

下面是SVTR的配置文件：

Global: use_gpu: True epoch_num: 5000 log_smooth_window: 20 print_batch_step: 10 save_model_dir: ./output/rec/rec_svtr_large_en/ save_epoch_step: 10 eval_batch_step: [20000, 2000] cal_metric_during_train: True pretrained_model: ./pretrain_models/rec_svtr_large_none_ctc_en_train/best_accuracy checkpoints: save_inference_dir: use_visualdl: False infer_img: doc/imgs_words_en/word_10.png character_dict_path: ./ppocr/utils/my_dict.txt character_type: en max_text_length: 25 infer_mode: False use_space_char: False save_res_path: ./output/rec/predicts_svtr_large.txt

Optimizer: name: AdamW beta1: 0.9 beta2: 0.99 epsilon: 0.00000008 weight_decay: 0.05 no_weight_decay_name: norm pos_embed one_dim_param_no_weight_decay: true lr: name: Cosine learning_rate: 0.000065 warmup_epoch: 100

Architecture: model_type: rec algorithm: SVTR Transform: name: STN_ON tps_inputsize: [32, 64] tps_outputsize: [48, 160] num_control_points: 20 tps_margins: [0.05,0.05] stn_activation: none Backbone: name: SVTRNet img_size: [48, 160] out_char_num: 40 out_channels: 384 patch_merging: 'Conv' embed_dim: [192, 256, 512] depth: [3, 9, 9] num_heads: [6, 8, 16] mixer: ['Local','Local','Local','Local','Local','Local','Local','Local','Local','Local','Global','Global','Global','Global','Global','Global','Global','Global','Global','Global','Global'] local_mixer: [[7, 11], [7, 11], [7, 11]] prenorm: false Neck: name: SequenceEncoder encoder_type: reshape Head: name: CTCHead

Loss: name: CTCLoss

PostProcess: name: CTCLabelDecode # SVTRLabelDecode is used for eval after train, please change to CTCLabelDecode when training

Metric: name: RecMetric main_indicator: acc

Train: dataset: name: LMDBDataSet data_dir: ./OCR_dataset/lmdb/train transforms: - DecodeImage: # load image img_mode: BGR channel_first: False - RecAug: - CTCLabelEncode: # Class handling label - RecResizeImg: character_dict_path: image_shape: [3, 64, 256] padding: False - KeepKeys: keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order loader: shuffle: True batch_size_per_card: 64 drop_last: True num_workers: 4

Eval: dataset: name: LMDBDataSet data_dir: ./OCR_dataset_test_lmdb transforms: - DecodeImage: # load image img_mode: BGR channel_first: False - CTCLabelEncode: # Class handling label - SVTRRecResizeImg: # SVTRRecResizeImg is used for eval after train, please change to RecResizeImg when training character_dict_path: image_shape: [3, 64, 256] padding: False - KeepKeys: keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order loader: shuffle: False drop_last: False batch_size_per_card: 64 num_workers: 2

运行命令为：

python tools/infer/predict_rec.py --image_dir="./OCR_dataset_test/crop_img/" --rec_model_dir="./inference/rec_svtr_large_stn_en/" --use_angle_cls=false --rec_char_dict_path="./ppocr/utils/my_dict.txt" --rec_image_shape="3,64,256"

Nov 07 '22 16:11 heyuhhh

I'm experiencing this exact problem as well, using this docker container paddlepaddle/paddle:2.3.2-gpu-cuda11.2-cudnn8 .

The output of tools/infer_rec.py when running my trained model on a test set is almost perfect.

The output of tools/infer/predict_rec.py when running the exported inference model on a test set is slightly off (outputs the wrong, but similar looking letters, random repetition of a character at the start or end of a string). Further testing confirms its almost half of the accuracy of the trained model.

Input image size are set the same for both scripts, as well as char dict.

Nov 07 '22 23:11 aydenk1

@dual19 Yeah, I have the same question like you. Do you use the SVTR rec algorithm? And do you use the tensorrt to speed up the inference? I find that this will get the worse results.

Nov 08 '22 02:11 heyuhhh

在预测时加上识别算法，也就是 --rec_algorithm="SVTR"，模型推理和之前一样了

不过使用tensorrt推理时还是有问题，输出很奇怪，下面是我的命令：

python tools/infer/predict_rec.py --image_dir="./OCR_dataset_test/crop_img/" --rec_model_dir="./inference/rec_svtr_large_stn_en/" --use_angle_cls=false --rec_char_dict_path="./ppocr/utils/tgb_dict.txt" --rec_image_shape="3,64,256" --rec_algorithm='SVTR' --use_tensorrt=True

tensorrt的版本是8.0.3.4的，paddle inference从下面的链接里面下载进行安装： https://paddle-inference-lib.bj.bcebos.com/2.4.0-rc0/python/Linux/GPU/x86-64_gcc8.2_avx_mkl_cuda11.2_cudnn8.2.1_trt8.0.3.4/paddlepaddle_gpu-2.4.0rc0.post112-cp37-cp37m-linux_x86_64.whl

目前测试检测模型和PPOCRv3的识别模型是没问题的，就SVTR的tensorrt有问题，请问如何解决呢

Nov 08 '22 03:11 heyuhhh

请问，方便提供一下测试用例和其预测结果吗？这样方便分析。

Nov 08 '22 06:11 Topdu

@dual19 I face the same issue when converting my training model to inference for PPOCRv3 model. Accuracy drops for some reason when using exported reason

Nov 10 '22 00:11 agupt013

Hi, I actually face the same issue too. I trained the model of SVTR and used tools/infer_rec.py and tools/infer/predict_rec.py to test the same dataset and the latter one has a huge accuracy decrease. I debug the code and find the function of resize_norm_img_svtr in predict_rec.py is different from the function resize_norm_img I used while training. the former one resize the image to 3,32,320 directly and the latter will keep the wh ratio using padding to keep the img with 3,32,320. When I change this function, the performance from two methods are the same. I hope this will help you~

Nov 10 '22 02:11 zr-icu

@chengchenng Do you use tensorrt to speed up the rec algorithm? I got some bad results when using it.

Nov 10 '22 02:11 heyuhhh

@chengchenng I noticed a difference between the resize function from tools/infer_rec.py and the eval script.

I had to set the infer_mode = False during the ops creation to match the accuracy for the training checkpoint between those two scripts.

I will check out the change you recommended for the inference model to see if that resolves this issue.

And no I haven't tested the tensorrt performance either

Nov 10 '22 02:11 aydenk1

请问，方便提供一下测试用例和其预测结果吗？这样方便分析。

您好，就是我正常使用SVTR进行推理是没问题的，但是使用tensorrt加速后就有问题了，环境上面应该已经提到过了，下面是使用tensorrt的测试用例：输出结果为：('210', 0.22712290287017822) 不使用tensorrt加速的结果为：('G39', 0.9999995231628418 输出结果为：('918', 0.15482832491397858) 不使用tensorrt加速的结果为：('G14', 0.9999997615814209)

Nov 10 '22 02:11 heyuhhh

请问，方便提供一下测试用例和其预测结果吗？这样方便分析。

您好，就是我正常使用SVTR进行推理是没问题的，但是使用tensorrt加速后就有问题了，环境上面应该已经提到过了，下面是使用tensorrt的测试用例：输出结果为：('210', 0.22712290287017822) 不使用tensorrt加速的结果为：('G39', 0.9999995231628418 输出结果为：('918', 0.15482832491397858) 不使用tensorrt加速的结果为：('G14', 0.9999997615814209)

感觉这样看起来肯定是哪里有问题。。不过我也还没有转。。之后如果转了没问题再和你说哈

Nov 10 '22 02:11 zr-icu

这样的情况很多吗？如果所有的测试用例都是这样的，有可能是矫正模块中的grid_sample算子的问题，这个算子因为很难并行在很多推理加速框架中不支持。

Nov 10 '22 03:11 Topdu

这样的情况很多吗？如果所有的测试用例都是这样的，有可能是矫正模块中的grid_sample算子的问题，这个算子因为很难并行在很多推理加速框架中不支持。

对，基本上都是这样的，模型输出的置信度不会超过0.25

Nov 10 '22 03:11 heyuhhh

导出模型时，可以先把矫正模块去掉，注意修改input_size。如果可以正常预测结果，说明是grid_sample的原因。

Nov 10 '22 08:11 Topdu

@Topdu 请问一下有示例的代码吗？

Nov 10 '22 09:11 heyuhhh

Global:
  use_gpu: True
  epoch_num: 20
  log_smooth_window: 20
  print_batch_step: 10
  save_model_dir: ./output/rec/svtr/
  save_epoch_step: 1
  # evaluation is run every 2000 iterations after the 0th iteration
  eval_batch_step: [0, 2000]
  cal_metric_during_train: True
  pretrained_model:
  checkpoints:
  save_inference_dir:
  use_visualdl: False
  infer_img: doc/imgs_words_en/word_10.png
  # for data or label process
  character_dict_path:
  character_type: en
  max_text_length: 25
  infer_mode: False
  use_space_char: False
  save_res_path: ./output/rec/predicts_svtr_tiny.txt


Optimizer:
  name: AdamW
  beta1: 0.9
  beta2: 0.99
  epsilon: 8.e-8
  weight_decay: 0.05
  no_weight_decay_name: norm pos_embed
  one_dim_param_no_weight_decay: true
  lr:
    name: Cosine
    learning_rate: 0.0005
    warmup_epoch: 2

Architecture:
  model_type: rec
  algorithm: SVTR
  Transform:
  Backbone:
    name: SVTRNet
    img_size: [32, 100]
    out_char_num: 25
    out_channels: 192
    patch_merging: 'Conv'
    embed_dim: [64, 128, 256]
    depth: [3, 6, 3]
    num_heads: [2, 4, 8]
    mixer: ['Local','Local','Local','Local','Local','Local','Global','Global','Global','Global','Global','Global']
    local_mixer: [[7, 11], [7, 11], [7, 11]]
    last_stage: True
    prenorm: false
  Neck:
    name: SequenceEncoder
    encoder_type: reshape
  Head:
    name: CTCHead

Loss:
  name: CTCLoss

PostProcess:
  name: CTCLabelDecode

Metric:
  name: RecMetric
  main_indicator: acc

Train:
  dataset:
    name: LMDBDataSet
    data_dir: ./train_data/data_lmdb_release/training/
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - CTCLabelEncode: # Class handling label
      - SVTRRecResizeImg:
          image_shape: [3, 32, 100]
          padding: False
      - KeepKeys:
          keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order
  loader:
    shuffle: True
    batch_size_per_card: 512
    drop_last: True
    num_workers: 4

Eval:
  dataset:
    name: LMDBDataSet
    data_dir: ./train_data/data_lmdb_release/evaluation/
    transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - CTCLabelEncode: # Class handling label
      - SVTRRecResizeImg:
          image_shape: [3, 32, 100]
          padding: False
      - KeepKeys:
          keep_keys: ['image', 'label', 'length'] # dataloader will return list in this order
  loader:
    shuffle: False
    drop_last: False
    batch_size_per_card: 256
    num_workers: 2

Nov 14 '22 02:11 Topdu

To follow up from @chengchenng comment, I've noticed two issues with the current ppocr issue.

The call method in predict_rec.py checks to see if the algorithm matches "SVTR" to determine the resizing function. Currently my rec_algorithm was set to "SVTR_LCNet", so changing this line to elif "SVTR" in self.rec_algorithm should resolve this issue.
After resolving the previous issue, accuracy improved but was still 75% compared to the 87% I was seeing in the eval.py script. To match eval.py, I had to use this resizing function. So after copying this inplace of the resize_norm_img_svtr function, accuracy went up to 87%.

@Topdu Would you like me to create a MR for these changes? Also why is the resizing function based on infer_mode based on infer_mode when it appears that one function is better suited for chinese characters?

Nov 14 '22 21:11 aydenk1

您好程铖已经收到您的邮件

Feb 14 '23 06:02 zr-icu

PaddleOCR PaddleOCR copied to clipboard

SVTR识别模型推理阶段输出与之前不一致

PaddleOCR
PaddleOCR copied to clipboard