PaddleOCR 训练数据集

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：Linux
版本号/Version：Paddle：2.6 PaddleOCR：问题相关组件/Related components：
运行指令/Command Code：python3 tools/train.py -c configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh.yml
完整报错/Complete Error Message： [2022/09/13 17:21:04] ppocr INFO: train dataloader has 19 iters [2022/09/13 17:21:04] ppocr INFO: valid dataloader has 7 iters [2022/09/13 17:21:04] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 19 iterations Corrupt JPEG data: bad Huffman code Corrupt JPEG data: 18 extraneous bytes before marker 0xc4

Sep 13 '22 09:09 bank010

部分数据有问题，如果比例不高不影响训练

Sep 13 '22 12:09 MissPenguin

部分数据有问题，如果比例不高不影响训练

这里的数据指的是数据集还是配置文件呢？

Sep 14 '22 01:09 bank010

我这边也遇到一样的问题，训练过程直接被终止，ser和re均无法正常训练 [2022/09/22 16:16:41] ppocr INFO: Architecture : [2022/09/22 16:16:41] ppocr INFO: Backbone : [2022/09/22 16:16:41] ppocr INFO: checkpoints : None [2022/09/22 16:16:41] ppocr INFO: mode : vi [2022/09/22 16:16:41] ppocr INFO: name : LayoutXLMForSer [2022/09/22 16:16:41] ppocr INFO: num_classes : 7 [2022/09/22 16:16:41] ppocr INFO: pretrained : True [2022/09/22 16:16:41] ppocr INFO: Transform : None [2022/09/22 16:16:41] ppocr INFO: algorithm : LayoutXLM [2022/09/22 16:16:41] ppocr INFO: model_type : kie [2022/09/22 16:16:41] ppocr INFO: Eval : [2022/09/22 16:16:41] ppocr INFO: dataset : [2022/09/22 16:16:41] ppocr INFO: data_dir : train_data/XFUND/zh_val/image [2022/09/22 16:16:41] ppocr INFO: label_file_list : ['train_data/XFUND/zh_val/val.json'] [2022/09/22 16:16:41] ppocr INFO: name : SimpleDataSet [2022/09/22 16:16:41] ppocr INFO: transforms : [2022/09/22 16:16:41] ppocr INFO: DecodeImage : [2022/09/22 16:16:41] ppocr INFO: channel_first : False [2022/09/22 16:16:41] ppocr INFO: img_mode : RGB [2022/09/22 16:16:41] ppocr INFO: VQATokenLabelEncode : [2022/09/22 16:16:41] ppocr INFO: algorithm : LayoutXLM [2022/09/22 16:16:41] ppocr INFO: class_path : train_data/XFUND/class_list_xfun.txt [2022/09/22 16:16:41] ppocr INFO: contains_re : False [2022/09/22 16:16:41] ppocr INFO: order_method : tb-yx [2022/09/22 16:16:41] ppocr INFO: use_textline_bbox_info : True [2022/09/22 16:16:41] ppocr INFO: VQATokenPad : [2022/09/22 16:16:41] ppocr INFO: max_seq_len : 512 [2022/09/22 16:16:41] ppocr INFO: return_attention_mask : True [2022/09/22 16:16:41] ppocr INFO: VQASerTokenChunk : [2022/09/22 16:16:41] ppocr INFO: max_seq_len : 512 [2022/09/22 16:16:41] ppocr INFO: Resize : [2022/09/22 16:16:41] ppocr INFO: size : [224, 224] [2022/09/22 16:16:41] ppocr INFO: NormalizeImage : [2022/09/22 16:16:41] ppocr INFO: mean : [123.675, 116.28, 103.53] [2022/09/22 16:16:41] ppocr INFO: order : hwc [2022/09/22 16:16:41] ppocr INFO: scale : 1 [2022/09/22 16:16:41] ppocr INFO: std : [58.395, 57.12, 57.375] [2022/09/22 16:16:41] ppocr INFO: ToCHWImage : None [2022/09/22 16:16:41] ppocr INFO: KeepKeys : [2022/09/22 16:16:41] ppocr INFO: keep_keys : ['input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels'] [2022/09/22 16:16:41] ppocr INFO: loader : [2022/09/22 16:16:41] ppocr INFO: batch_size_per_card : 8 [2022/09/22 16:16:41] ppocr INFO: drop_last : False [2022/09/22 16:16:41] ppocr INFO: num_workers : 4 [2022/09/22 16:16:41] ppocr INFO: shuffle : False [2022/09/22 16:16:41] ppocr INFO: Global : [2022/09/22 16:16:41] ppocr INFO: cal_metric_during_train : False [2022/09/22 16:16:41] ppocr INFO: distributed : False [2022/09/22 16:16:41] ppocr INFO: epoch_num : 200 [2022/09/22 16:16:41] ppocr INFO: eval_batch_step : [0, 19] [2022/09/22 16:16:41] ppocr INFO: infer_img : ppstructure/docs/kie/input/zh_val_42.jpg [2022/09/22 16:16:41] ppocr INFO: kie_det_model_dir : None [2022/09/22 16:16:41] ppocr INFO: kie_rec_model_dir : None [2022/09/22 16:16:41] ppocr INFO: log_smooth_window : 10 [2022/09/22 16:16:41] ppocr INFO: print_batch_step : 10 [2022/09/22 16:16:41] ppocr INFO: save_epoch_step : 2000 [2022/09/22 16:16:41] ppocr INFO: save_inference_dir : None [2022/09/22 16:16:41] ppocr INFO: save_model_dir : ./output/ser_vi_layoutxlm_xfund_zh [2022/09/22 16:16:41] ppocr INFO: save_res_path : ./output/ser/xfund_zh/res [2022/09/22 16:16:41] ppocr INFO: seed : 2022 [2022/09/22 16:16:41] ppocr INFO: use_gpu : True [2022/09/22 16:16:41] ppocr INFO: use_visualdl : False [2022/09/22 16:16:41] ppocr INFO: Loss : [2022/09/22 16:16:41] ppocr INFO: key : backbone_out [2022/09/22 16:16:41] ppocr INFO: name : VQASerTokenLayoutLMLoss [2022/09/22 16:16:41] ppocr INFO: num_classes : 7 [2022/09/22 16:16:41] ppocr INFO: Metric : [2022/09/22 16:16:41] ppocr INFO: main_indicator : hmean [2022/09/22 16:16:41] ppocr INFO: name : VQASerTokenMetric [2022/09/22 16:16:41] ppocr INFO: Optimizer : [2022/09/22 16:16:41] ppocr INFO: beta1 : 0.9 [2022/09/22 16:16:41] ppocr INFO: beta2 : 0.999 [2022/09/22 16:16:41] ppocr INFO: lr : [2022/09/22 16:16:41] ppocr INFO: epochs : 200 [2022/09/22 16:16:41] ppocr INFO: learning_rate : 5e-05 [2022/09/22 16:16:41] ppocr INFO: name : Linear [2022/09/22 16:16:41] ppocr INFO: warmup_epoch : 2 [2022/09/22 16:16:41] ppocr INFO: name : AdamW [2022/09/22 16:16:41] ppocr INFO: regularizer : [2022/09/22 16:16:41] ppocr INFO: factor : 0.0 [2022/09/22 16:16:41] ppocr INFO: name : L2 [2022/09/22 16:16:41] ppocr INFO: PostProcess : [2022/09/22 16:16:41] ppocr INFO: class_path : train_data/XFUND/class_list_xfun.txt [2022/09/22 16:16:41] ppocr INFO: name : VQASerTokenLayoutLMPostProcess [2022/09/22 16:16:41] ppocr INFO: Train : [2022/09/22 16:16:41] ppocr INFO: dataset : [2022/09/22 16:16:41] ppocr INFO: data_dir : train_data/XFUND/zh_train/image [2022/09/22 16:16:41] ppocr INFO: label_file_list : ['train_data/XFUND/zh_train/train.json'] [2022/09/22 16:16:41] ppocr INFO: name : SimpleDataSet [2022/09/22 16:16:41] ppocr INFO: ratio_list : [1.0] [2022/09/22 16:16:41] ppocr INFO: transforms : [2022/09/22 16:16:41] ppocr INFO: DecodeImage : [2022/09/22 16:16:41] ppocr INFO: channel_first : False [2022/09/22 16:16:41] ppocr INFO: img_mode : RGB [2022/09/22 16:16:41] ppocr INFO: VQATokenLabelEncode : [2022/09/22 16:16:41] ppocr INFO: algorithm : LayoutXLM [2022/09/22 16:16:41] ppocr INFO: class_path : train_data/XFUND/class_list_xfun.txt [2022/09/22 16:16:41] ppocr INFO: contains_re : False [2022/09/22 16:16:41] ppocr INFO: order_method : tb-yx [2022/09/22 16:16:41] ppocr INFO: use_textline_bbox_info : True [2022/09/22 16:16:41] ppocr INFO: VQATokenPad : [2022/09/22 16:16:41] ppocr INFO: max_seq_len : 512 [2022/09/22 16:16:41] ppocr INFO: return_attention_mask : True [2022/09/22 16:16:41] ppocr INFO: VQASerTokenChunk : [2022/09/22 16:16:41] ppocr INFO: max_seq_len : 512 [2022/09/22 16:16:41] ppocr INFO: Resize : [2022/09/22 16:16:41] ppocr INFO: size : [224, 224] [2022/09/22 16:16:41] ppocr INFO: NormalizeImage : [2022/09/22 16:16:41] ppocr INFO: mean : [123.675, 116.28, 103.53] [2022/09/22 16:16:41] ppocr INFO: order : hwc [2022/09/22 16:16:41] ppocr INFO: scale : 1 [2022/09/22 16:16:41] ppocr INFO: std : [58.395, 57.12, 57.375] [2022/09/22 16:16:41] ppocr INFO: ToCHWImage : None [2022/09/22 16:16:41] ppocr INFO: KeepKeys : [2022/09/22 16:16:41] ppocr INFO: keep_keys : ['input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels'] [2022/09/22 16:16:41] ppocr INFO: loader : [2022/09/22 16:16:41] ppocr INFO: batch_size_per_card : 8 [2022/09/22 16:16:41] ppocr INFO: drop_last : False [2022/09/22 16:16:41] ppocr INFO: num_workers : 4 [2022/09/22 16:16:41] ppocr INFO: shuffle : True [2022/09/22 16:16:41] ppocr INFO: profiler_options : None [2022/09/22 16:16:41] ppocr INFO: train with paddle 2.3.2 and device Place(gpu:0) [2022/09/22 16:16:41] ppocr INFO: Initialize indexs of datasets:['train_data/XFUND/zh_train/train.json'] [2022-09-22 16:16:42,232] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/layoutxlm-base-uncased/sentencepiece.bpe.model [2022-09-22 16:16:42,988] [ INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/layoutxlm-base-uncased/tokenizer_config.json [2022-09-22 16:16:42,988] [ INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/layoutxlm-base-uncased/special_tokens_map.json [2022/09/22 16:16:42] ppocr INFO: Initialize indexs of datasets:['train_data/XFUND/zh_val/val.json'] [2022-09-22 16:16:42,990] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/layoutxlm-base-uncased/sentencepiece.bpe.model [2022-09-22 16:16:43,720] [ INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/layoutxlm-base-uncased/tokenizer_config.json [2022-09-22 16:16:43,720] [ INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/layoutxlm-base-uncased/special_tokens_map.json [2022-09-22 16:16:43,723] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/vi-layoutxlm-base-uncased/model_state.pdparams W0922 16:16:43.724370 21166 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2 W0922 16:16:43.728435 21166 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2. [2022/09/22 16:16:48] ppocr INFO: train dataloader has 19 iters [2022/09/22 16:16:48] ppocr INFO: valid dataloader has 7 iters [2022/09/22 16:16:48] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 19 iterations Corrupt JPEG data: 18 extraneous bytes before marker 0xc4

环境 aistudio V100 paddlenlp 2.3.0.dev0 paddleocr 2.6 paddlepaddle-gpu 2.3.2.post112

Sep 22 '22 08:09 Nathan-Wang19

I have encountered the same problem, what is the soln for this ?

Feb 02 '23 03:02 TasneemVKhan

PaddleOCR PaddleOCR copied to clipboard

训练数据集

PaddleOCR
PaddleOCR copied to clipboard