PaddleOCR Out of GPU memory on training KIE destillation

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：WIndows 11
版本号/Version：Paddle：2.3.2 PaddleOCR：2.6 问题相关组件/Related components：CUDA 10.2 CUDNN 8.4
运行指令/Command Code：python tools/train.py -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh_udml.yml
完整报错/Complete Error Message：[2022/10/21 15:18:49] ppocr INFO: Architecture : [2022/10/21 15:18:49] ppocr INFO: Models : [2022/10/21 15:18:49] ppocr INFO: Student : [2022/10/21 15:18:49] ppocr INFO: Backbone : [2022/10/21 15:18:49] ppocr INFO: checkpoints : None [2022/10/21 15:18:49] ppocr INFO: mode : vi [2022/10/21 15:18:49] ppocr INFO: name : LayoutXLMForSer [2022/10/21 15:18:49] ppocr INFO: num_classes : 44 [2022/10/21 15:18:49] ppocr INFO: pretrained : True [2022/10/21 15:18:49] ppocr INFO: Transform : None [2022/10/21 15:18:49] ppocr INFO: algorithm : LayoutXLM [2022/10/21 15:18:49] ppocr INFO: freeze_params : False [2022/10/21 15:18:49] ppocr INFO: model_type : kie [2022/10/21 15:18:49] ppocr INFO: pretrained : None [2022/10/21 15:18:49] ppocr INFO: return_all_feats : True [2022/10/21 15:18:49] ppocr INFO: Teacher : [2022/10/21 15:18:49] ppocr INFO: Backbone : [2022/10/21 15:18:49] ppocr INFO: checkpoints : None [2022/10/21 15:18:49] ppocr INFO: mode : vi [2022/10/21 15:18:49] ppocr INFO: name : LayoutXLMForSer [2022/10/21 15:18:49] ppocr INFO: num_classes : 44 [2022/10/21 15:18:49] ppocr INFO: pretrained : True [2022/10/21 15:18:49] ppocr INFO: Transform : None [2022/10/21 15:18:49] ppocr INFO: algorithm : LayoutXLM [2022/10/21 15:18:49] ppocr INFO: freeze_params : False [2022/10/21 15:18:49] ppocr INFO: model_type : kie [2022/10/21 15:18:49] ppocr INFO: pretrained : None [2022/10/21 15:18:49] ppocr INFO: return_all_feats : True [2022/10/21 15:18:49] ppocr INFO: algorithm : Distillation [2022/10/21 15:18:49] ppocr INFO: model_type : kie [2022/10/21 15:18:49] ppocr INFO: name : DistillationModel [2022/10/21 15:18:49] ppocr INFO: Eval : [2022/10/21 15:18:49] ppocr INFO: dataset : [2022/10/21 15:18:49] ppocr INFO: data_dir : train_data/det/val [2022/10/21 15:18:49] ppocr INFO: label_file_list : ['train_data/det/val_kie.txt'] [2022/10/21 15:18:49] ppocr INFO: name : SimpleDataSet [2022/10/21 15:18:49] ppocr INFO: transforms : [2022/10/21 15:18:49] ppocr INFO: DecodeImage : [2022/10/21 15:18:49] ppocr INFO: channel_first : False [2022/10/21 15:18:49] ppocr INFO: img_mode : RGB [2022/10/21 15:18:49] ppocr INFO: VQATokenLabelEncode : [2022/10/21 15:18:49] ppocr INFO: algorithm : LayoutXLM [2022/10/21 15:18:49] ppocr INFO: class_path : train_data/det/kie_dict.txt [2022/10/21 15:18:49] ppocr INFO: contains_re : False [2022/10/21 15:18:49] ppocr INFO: order_method : tb-yx [2022/10/21 15:18:49] ppocr INFO: VQATokenPad : [2022/10/21 15:18:49] ppocr INFO: max_seq_len : 512 [2022/10/21 15:18:49] ppocr INFO: return_attention_mask : True [2022/10/21 15:18:49] ppocr INFO: VQASerTokenChunk : [2022/10/21 15:18:49] ppocr INFO: max_seq_len : 512 [2022/10/21 15:18:49] ppocr INFO: Resize : [2022/10/21 15:18:49] ppocr INFO: size : [224, 224] [2022/10/21 15:18:49] ppocr INFO: NormalizeImage : [2022/10/21 15:18:49] ppocr INFO: mean : [123.675, 116.28, 103.53] [2022/10/21 15:18:49] ppocr INFO: order : hwc [2022/10/21 15:18:49] ppocr INFO: scale : 1 [2022/10/21 15:18:49] ppocr INFO: std : [58.395, 57.12, 57.375] [2022/10/21 15:18:49] ppocr INFO: ToCHWImage : None [2022/10/21 15:18:49] ppocr INFO: KeepKeys : [2022/10/21 15:18:49] ppocr INFO: keep_keys : ['input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels'] [2022/10/21 15:18:49] ppocr INFO: loader : [2022/10/21 15:18:49] ppocr INFO: batch_size_per_card : 1 [2022/10/21 15:18:49] ppocr INFO: drop_last : False [2022/10/21 15:18:49] ppocr INFO: num_workers : 0 [2022/10/21 15:18:49] ppocr INFO: shuffle : False [2022/10/21 15:18:49] ppocr INFO: Global : [2022/10/21 15:18:49] ppocr INFO: cal_metric_during_train : False [2022/10/21 15:18:49] ppocr INFO: distributed : False [2022/10/21 15:18:49] ppocr INFO: epoch_num : 200 [2022/10/21 15:18:49] ppocr INFO: eval_batch_step : [0, 10] [2022/10/21 15:18:49] ppocr INFO: infer_img : ppstructure/docs/kie/input/zh_val_42.jpg [2022/10/21 15:18:49] ppocr INFO: log_smooth_window : 10 [2022/10/21 15:18:49] ppocr INFO: print_batch_step : 10 [2022/10/21 15:18:49] ppocr INFO: save_epoch_step : 2000 [2022/10/21 15:18:49] ppocr INFO: save_inference_dir : None [2022/10/21 15:18:49] ppocr INFO: save_model_dir : ./output/ser_vi_layoutxlm_xfund_zh_udml [2022/10/21 15:18:49] ppocr INFO: save_res_path : ./output/ser_layoutxlm_xfund_zh/res [2022/10/21 15:18:49] ppocr INFO: seed : 2022 [2022/10/21 15:18:49] ppocr INFO: use_gpu : True [2022/10/21 15:18:49] ppocr INFO: use_visualdl : True [2022/10/21 15:18:49] ppocr INFO: Loss : [2022/10/21 15:18:49] ppocr INFO: loss_config_list : [2022/10/21 15:18:49] ppocr INFO: DistillationVQASerTokenLayoutLMLoss : [2022/10/21 15:18:49] ppocr INFO: key : backbone_out [2022/10/21 15:18:49] ppocr INFO: model_name_list : ['Student', 'Teacher'] [2022/10/21 15:18:49] ppocr INFO: num_classes : 44 [2022/10/21 15:18:49] ppocr INFO: weight : 1.0 [2022/10/21 15:18:49] ppocr INFO: DistillationSERDMLLoss : [2022/10/21 15:18:49] ppocr INFO: act : softmax [2022/10/21 15:18:49] ppocr INFO: key : backbone_out [2022/10/21 15:18:49] ppocr INFO: model_name_pairs : [['Student', 'Teacher']] [2022/10/21 15:18:49] ppocr INFO: use_log : True [2022/10/21 15:18:49] ppocr INFO: weight : 1.0 [2022/10/21 15:18:49] ppocr INFO: DistillationVQADistanceLoss : [2022/10/21 15:18:49] ppocr INFO: key : hidden_states_5 [2022/10/21 15:18:49] ppocr INFO: mode : l2 [2022/10/21 15:18:49] ppocr INFO: model_name_pairs : [['Student', 'Teacher']] [2022/10/21 15:18:49] ppocr INFO: name : loss_5 [2022/10/21 15:18:49] ppocr INFO: weight : 0.5 [2022/10/21 15:18:49] ppocr INFO: DistillationVQADistanceLoss : [2022/10/21 15:18:49] ppocr INFO: key : hidden_states_8 [2022/10/21 15:18:49] ppocr INFO: mode : l2 [2022/10/21 15:18:49] ppocr INFO: model_name_pairs : [['Student', 'Teacher']] [2022/10/21 15:18:49] ppocr INFO: name : loss_8 [2022/10/21 15:18:49] ppocr INFO: weight : 0.5 [2022/10/21 15:18:49] ppocr INFO: name : CombinedLoss [2022/10/21 15:18:49] ppocr INFO: Metric : [2022/10/21 15:18:49] ppocr INFO: base_metric_name : VQASerTokenMetric [2022/10/21 15:18:49] ppocr INFO: key : Student [2022/10/21 15:18:49] ppocr INFO: main_indicator : hmean [2022/10/21 15:18:49] ppocr INFO: name : DistillationMetric [2022/10/21 15:18:49] ppocr INFO: Optimizer : [2022/10/21 15:18:49] ppocr INFO: beta1 : 0.9 [2022/10/21 15:18:49] ppocr INFO: beta2 : 0.999 [2022/10/21 15:18:49] ppocr INFO: lr : [2022/10/21 15:18:49] ppocr INFO: epochs : 200 [2022/10/21 15:18:49] ppocr INFO: learning_rate : 5e-05 [2022/10/21 15:18:49] ppocr INFO: name : Linear [2022/10/21 15:18:49] ppocr INFO: warmup_epoch : 10 [2022/10/21 15:18:49] ppocr INFO: name : AdamW [2022/10/21 15:18:49] ppocr INFO: regularizer : [2022/10/21 15:18:49] ppocr INFO: factor : 0.0 [2022/10/21 15:18:49] ppocr INFO: name : L2 [2022/10/21 15:18:49] ppocr INFO: PostProcess : [2022/10/21 15:18:49] ppocr INFO: class_path : train_data/det/kie_dict.txt [2022/10/21 15:18:49] ppocr INFO: key : backbone_out [2022/10/21 15:18:49] ppocr INFO: model_name : ['Student', 'Teacher'] [2022/10/21 15:18:49] ppocr INFO: name : DistillationSerPostProcess [2022/10/21 15:18:49] ppocr INFO: Train : [2022/10/21 15:18:49] ppocr INFO: dataset : [2022/10/21 15:18:49] ppocr INFO: data_dir : train_data/det/train [2022/10/21 15:18:49] ppocr INFO: label_file_list : ['train_data/det/train_kie.txt'] [2022/10/21 15:18:49] ppocr INFO: name : SimpleDataSet [2022/10/21 15:18:49] ppocr INFO: ratio_list : [1.0] [2022/10/21 15:18:49] ppocr INFO: transforms : [2022/10/21 15:18:49] ppocr INFO: DecodeImage : [2022/10/21 15:18:49] ppocr INFO: channel_first : False [2022/10/21 15:18:49] ppocr INFO: img_mode : RGB [2022/10/21 15:18:49] ppocr INFO: VQATokenLabelEncode : [2022/10/21 15:18:49] ppocr INFO: algorithm : LayoutXLM [2022/10/21 15:18:49] ppocr INFO: class_path : train_data/det/kie_dict.txt [2022/10/21 15:18:49] ppocr INFO: contains_re : False [2022/10/21 15:18:49] ppocr INFO: order_method : tb-yx [2022/10/21 15:18:49] ppocr INFO: VQATokenPad : [2022/10/21 15:18:49] ppocr INFO: max_seq_len : 512 [2022/10/21 15:18:49] ppocr INFO: return_attention_mask : True [2022/10/21 15:18:49] ppocr INFO: VQASerTokenChunk : [2022/10/21 15:18:49] ppocr INFO: max_seq_len : 512 [2022/10/21 15:18:49] ppocr INFO: Resize : [2022/10/21 15:18:49] ppocr INFO: size : [224, 224] [2022/10/21 15:18:49] ppocr INFO: NormalizeImage : [2022/10/21 15:18:49] ppocr INFO: mean : [123.675, 116.28, 103.53] [2022/10/21 15:18:49] ppocr INFO: order : hwc [2022/10/21 15:18:49] ppocr INFO: scale : 1 [2022/10/21 15:18:49] ppocr INFO: std : [58.395, 57.12, 57.375] [2022/10/21 15:18:49] ppocr INFO: ToCHWImage : None [2022/10/21 15:18:49] ppocr INFO: KeepKeys : [2022/10/21 15:18:49] ppocr INFO: keep_keys : ['input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels'] [2022/10/21 15:18:49] ppocr INFO: loader : [2022/10/21 15:18:49] ppocr INFO: batch_size_per_card : 1 [2022/10/21 15:18:49] ppocr INFO: drop_last : False [2022/10/21 15:18:49] ppocr INFO: num_workers : 0 [2022/10/21 15:18:49] ppocr INFO: shuffle : True [2022/10/21 15:18:49] ppocr INFO: profiler_options : None [2022/10/21 15:18:49] ppocr INFO: train with paddle 2.3.2 and device Place(gpu:0) [2022/10/21 15:18:49] ppocr INFO: Initialize indexs of datasets:['train_data/det/train_kie.txt'] [2022-10-21 15:18:50,527] [ INFO] - Already cached C:\Users\Ruben.paddlenlp\models\layoutxlm-base-uncased\sentencepiece.bpe.model [2022-10-21 15:18:50,958] [ INFO] - tokenizer config file saved in C:\Users\Ruben.paddlenlp\models\layoutxlm-base-uncased\tokenizer_config.json [2022-10-21 15:18:50,959] [ INFO] - Special tokens file saved in C:\Users\Ruben.paddlenlp\models\layoutxlm-base-uncased\special_tokens_map.json [2022/10/21 15:18:50] ppocr INFO: Initialize indexs of datasets:['train_data/det/val_kie.txt'] [2022-10-21 15:18:50,960] [ INFO] - Already cached C:\Users\Ruben.paddlenlp\models\layoutxlm-base-uncased\sentencepiece.bpe.model [2022-10-21 15:18:51,367] [ INFO] - tokenizer config file saved in C:\Users\Ruben.paddlenlp\models\layoutxlm-base-uncased\tokenizer_config.json [2022-10-21 15:18:51,368] [ INFO] - Special tokens file saved in C:\Users\Ruben.paddlenlp\models\layoutxlm-base-uncased\special_tokens_map.json [2022-10-21 15:18:51,372] [ INFO] - Already cached C:\Users\Ruben.paddlenlp\models\vi-layoutxlm-base-uncased\model_state.pdparams W1021 15:18:51.374509 21996 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.7, Runtime API Version: 10.2 W1021 15:18:51.387521 21996 gpu_resources.cc:91] device: 0, cuDNN Version: 8.4. [2022-10-21 15:18:56,264] [ INFO] - Already cached C:\Users\Ruben.paddlenlp\models\vi-layoutxlm-base-uncased\model_state.pdparams [2022/10/21 15:18:57] ppocr INFO: train dataloader has 72 iters [2022/10/21 15:18:57] ppocr INFO: valid dataloader has 18 iters [2022/10/21 15:18:57] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 10 iterations Traceback (most recent call last): File "tools/train.py", line 202, in main(config, device, logger, vdl_writer) File "tools/train.py", line 177, in main eval_class, pre_best_model_dict, logger, vdl_writer, scaler,amp_level, amp_custom_black_list) File "C:\Work\CENATAV\Libraries\PaddleOCR\tools\program.py", line 302, in train loss = loss_class(preds, batch) File "C:\Users\Ruben.conda\envs\paddleocr\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "C:\Users\Ruben.conda\envs\paddleocr\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "C:\Work\CENATAV\Libraries\PaddleOCR\ppocr\losses\combined_loss.py", line 58, in forward loss = loss_func(input, batch, **kargs) File "C:\Users\Ruben.conda\envs\paddleocr\lib\site-packages\paddle\fluid\dygraph\layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "C:\Users\Ruben.conda\envs\paddleocr\lib\site-packages\paddle\fluid\dygraph\layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "C:\Work\CENATAV\Libraries\PaddleOCR\ppocr\losses\distillation_loss.py", line 346, in forward loss = super().forward(out, batch) File "C:\Work\CENATAV\Libraries\PaddleOCR\ppocr\losses\vqa_token_layoutlm_loss.py", line 39, in forward [-1, self.num_classes])[active_loss] File "C:\Users\Ruben.conda\envs\paddleocr\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 736, in getitem return getitem_impl(self, item) File "C:\Users\Ruben.conda\envs\paddleocr\lib\site-packages\paddle\fluid\variable_index.py", line 431, in getitem_impl return get_value_for_bool_tensor(var, slice_item) File "C:\Users\Ruben.conda\envs\paddleocr\lib\site-packages\paddle\fluid\variable_index.py", line 311, in get_value_for_bool_tensor lambda: idx_not_empty(var, item)) File "C:\Users\Ruben.conda\envs\paddleocr\lib\site-packages\paddle\fluid\layers\control_flow.py", line 2466, in cond return false_fn() File "C:\Users\Ruben.conda\envs\paddleocr\lib\site-packages\paddle\fluid\variable_index.py", line 311, in lambda: idx_not_empty(var, item)) File "C:\Users\Ruben.conda\envs\paddleocr\lib\site-packages\paddle\fluid\variable_index.py", line 300, in idx_not_empty bool_2_idx = where(item == True) File "C:\Users\Ruben.conda\envs\paddleocr\lib\site-packages\paddle\fluid\layers\nn.py", line 14566, in where return _C_ops.where_index(condition) SystemError: (Fatal) Operator where_index raises an struct paddle::memory::allocation::BadAlloc exception. The exception content is :ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 7.473707GB memory on GPU 0, 5.299316GB memory has been allocated and available memory is only 717.000000MB.

Please check whether there is any other process using GPU 0.

If yes, please stop them, or start PaddlePaddle on another GPU. If no, please decrease the batch size of your model. (at ..\paddle\fluid\memory\allocation\cuda_allocator.cc:87) . (at ..\paddle\fluid\imperative\tracer.cc:307)

Im trying to train KIE with knowledge destillation, but it throws me out of memory i have 6gb of vram on my 3060, i am using cuda 10.2 with cudnn 8.4, my paddle version is 2.3.2. I also drop the batches size to 1 as you can see in the config, is there any way i can do to train this model i would really appreciate any info. thanks in advance.

Oct 22 '22 15:10 rubensanchezrivero

KIE is a high-memory task that requires more memory, you can try train this model on another machine with more memory

Oct 25 '22 01:10 an1018

ok then

Oct 25 '22 21:10 rubensanchezrivero

PaddleOCR PaddleOCR copied to clipboard

Out of GPU memory on training KIE destillation

PaddleOCR
PaddleOCR copied to clipboard