PaddleNLP icon indicating copy to clipboard operation
PaddleNLP copied to clipboard

[Bug]: 精调UIE模型报错

Open joesong168 opened this issue 1 year ago • 12 comments

软件环境

- paddlepaddle:
- paddlepaddle-gpu: 2.3.2
- paddlenlp: 2.4

重复问题

  • [X] I have searched the existing issues

错误描述

Traceback (most recent call last):
  File "/home/aistudio/PaddleNLP/model_zoo/uie/finetune.py", line 287, in <module>
    main()
  File "/home/aistudio/PaddleNLP/model_zoo/uie/finetune.py", line 209, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/trainer/trainer_base.py", line 582, in train
    ignore_keys_for_eval)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/trainer/trainer_base.py", line 710, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/trainer/trainer_base.py", line 1324, in evaluate
    metric_key_prefix=metric_key_prefix,
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/trainer/trainer_base.py", line 1423, in evaluation_loop
    ignore_keys=ignore_keys)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/trainer/trainer_base.py", line 1605, in prediction_step
    outputs = model(**inputs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
TypeError: forward() got an unexpected keyword argument 'start_positions'

稳定复现步骤 & 代码

python3 /home/aistudio/PaddleNLP/model_zoo/uie/finetune.py
--device gpu
--logging_steps 10
--save_steps 100
--eval_steps 100
--seed 42
--model_name_or_path uie-base
--output_dir $finetuned_model
--train_path data/train.txt
--dev_path data/dev.txt
--max_seq_length 512
--per_device_eval_batch_size 16
--per_device_train_batch_size 16
--num_train_epochs 100
--learning_rate 1e-5
--do_train
--do_eval
--do_export
--export_model_dir $finetuned_model
--overwrite_output_dir
--disable_tqdm True
--metric_for_best_model eval_f1
--load_best_model_at_end True
--save_total_limit 1

joesong168 avatar Nov 03 '22 08:11 joesong168

可以加上 --label_names 'start_positions end_positions' 这个参数选项 最近的版本有些改动,我们马上修复一下

wawltor avatar Nov 03 '22 08:11 wawltor

可以加上 --label_names 'start_positions end_positions' 这个参数选项 最近的版本有些改动,我们马上修复一下

我也遇到这个问题,加上了参数选项,依旧报相同的错

pfchai avatar Nov 03 '22 09:11 pfchai

同样的问题,加上了--label_names 参数,依旧报错

starryzwh avatar Nov 03 '22 09:11 starryzwh

[2022-11-03 17:16:32,036] [ INFO] - ***** Running training ***** [2022-11-03 17:16:32,036] [ INFO] - Num examples = 9048 [2022-11-03 17:16:32,036] [ INFO] - Num Epochs = 100 [2022-11-03 17:16:32,036] [ INFO] - Instantaneous batch size per device = 16 [2022-11-03 17:16:32,036] [ INFO] - Total train batch size (w. parallel, distributed & accumulation) = 16 [2022-11-03 17:16:32,036] [ INFO] - Gradient Accumulation steps = 1 [2022-11-03 17:16:32,036] [ INFO] - Total optimization steps = 56600.0 [2022-11-03 17:16:32,036] [ INFO] - Total num train samples = 904800 段错误 (核心已转储)

joesong168 avatar Nov 03 '22 09:11 joesong168

@pfchai 又试了一下 --label_names 后面不需要''引号,运行正常

starryzwh avatar Nov 03 '22 09:11 starryzwh

抱歉,文档已经修复 image

wawltor avatar Nov 03 '22 09:11 wawltor

export finetuned_model=./checkpoint/model_best

python3 /home/aistudio/PaddleNLP/model_zoo/uie/finetune.py
--device gpu
--logging_steps 10
--save_steps 100
--eval_steps 100
--seed 42
--model_name_or_path uie-nano
--output_dir $finetuned_model
--train_path data/data175474/train.txt
--dev_path data/data175474/dev.txt
--max_seq_length 512
--per_device_eval_batch_size 16
--per_device_train_batch_size 16
--num_train_epochs 100
--learning_rate 1e-5
--do_train
--do_eval
--do_export
--export_model_dir $finetuned_model
--overwrite_output_dir
--disable_tqdm True
--metric_for_best_model eval_f1
--load_best_model_at_end True
--save_total_limit 1
--label_names start_positions end_positions

段错误 (核心已转储)

还是报错

joesong168 avatar Nov 03 '22 09:11 joesong168

[2022-11-03 17:38:19,422] [ WARNING] - evaluation_strategy reset to IntervalStrategy.STEPS for do_eval is True. you can also set evaluation_strategy='epoch'. [2022-11-03 17:38:19,422] [ INFO] - The default value for the training argument --report_to will change in v5 (from all installed integrations to none). In v5, you will need to use --report_to all to get the same behavior as now. You should start updating your code and make this info disappear :-). [2022-11-03 17:38:19,422] [ INFO] - ============================================================ [2022-11-03 17:38:19,422] [ INFO] - Model Configuration Arguments
[2022-11-03 17:38:19,422] [ INFO] - paddle commit id :4596b9a22540fb0ea5d369c3c804544de61d03d0 [2022-11-03 17:38:19,422] [ INFO] - export_model_dir :./checkpoint/model_best [2022-11-03 17:38:19,422] [ INFO] - model_name_or_path :uie-nano [2022-11-03 17:38:19,422] [ INFO] - multilingual :False [2022-11-03 17:38:19,422] [ INFO] - [2022-11-03 17:38:19,422] [ INFO] - ============================================================ [2022-11-03 17:38:19,422] [ INFO] - Data Configuration Arguments
[2022-11-03 17:38:19,422] [ INFO] - paddle commit id :4596b9a22540fb0ea5d369c3c804544de61d03d0 [2022-11-03 17:38:19,423] [ INFO] - dev_path :data/data175474/dev.txt [2022-11-03 17:38:19,423] [ INFO] - max_seq_length :512 [2022-11-03 17:38:19,423] [ INFO] - train_path :data/data175474/train.txt [2022-11-03 17:38:19,423] [ INFO] - [2022-11-03 17:38:19,423] [ WARNING] - Process rank: -1, device: gpu, world_size: 1, distributed training: False, 16-bits training: False [2022-11-03 17:38:19,423] [ INFO] - Downloading resource files... [2022-11-03 17:38:19,425] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'uie-nano'. W1103 17:38:19.452342 2234 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2 W1103 17:38:19.456578 2234 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2. [2022-11-03 17:38:21,033] [ INFO] - ============================================================ [2022-11-03 17:38:21,033] [ INFO] - Training Configuration Arguments
[2022-11-03 17:38:21,033] [ INFO] - paddle commit id :4596b9a22540fb0ea5d369c3c804544de61d03d0 [2022-11-03 17:38:21,033] [ INFO] - _no_sync_in_gradient_accumulation:True [2022-11-03 17:38:21,033] [ INFO] - activation_preprocess_type :None [2022-11-03 17:38:21,033] [ INFO] - activation_quantize_type :None [2022-11-03 17:38:21,033] [ INFO] - adam_beta1 :0.9 [2022-11-03 17:38:21,033] [ INFO] - adam_beta2 :0.999 [2022-11-03 17:38:21,033] [ INFO] - adam_epsilon :1e-08 [2022-11-03 17:38:21,033] [ INFO] - algo_list :None [2022-11-03 17:38:21,033] [ INFO] - batch_num_list :None [2022-11-03 17:38:21,033] [ INFO] - batch_size_list :None [2022-11-03 17:38:21,033] [ INFO] - bias_correction :False [2022-11-03 17:38:21,033] [ INFO] - current_device :gpu:0 [2022-11-03 17:38:21,034] [ INFO] - dataloader_drop_last :False [2022-11-03 17:38:21,034] [ INFO] - dataloader_num_workers :0 [2022-11-03 17:38:21,034] [ INFO] - device :gpu [2022-11-03 17:38:21,034] [ INFO] - disable_tqdm :True [2022-11-03 17:38:21,034] [ INFO] - do_eval :True [2022-11-03 17:38:21,034] [ INFO] - do_export :True [2022-11-03 17:38:21,034] [ INFO] - do_predict :False [2022-11-03 17:38:21,034] [ INFO] - do_train :True [2022-11-03 17:38:21,034] [ INFO] - eval_batch_size :16 [2022-11-03 17:38:21,034] [ INFO] - eval_steps :100 [2022-11-03 17:38:21,034] [ INFO] - evaluation_strategy :IntervalStrategy.STEPS [2022-11-03 17:38:21,034] [ INFO] - fp16 :False [2022-11-03 17:38:21,034] [ INFO] - fp16_opt_level :O1 [2022-11-03 17:38:21,034] [ INFO] - gradient_accumulation_steps :1 [2022-11-03 17:38:21,034] [ INFO] - greater_is_better :True [2022-11-03 17:38:21,034] [ INFO] - ignore_data_skip :False [2022-11-03 17:38:21,034] [ INFO] - input_infer_model_path :None [2022-11-03 17:38:21,034] [ INFO] - label_names :['start_positions', 'end_positions'] [2022-11-03 17:38:21,034] [ INFO] - learning_rate :1e-05 [2022-11-03 17:38:21,034] [ INFO] - load_best_model_at_end :True [2022-11-03 17:38:21,034] [ INFO] - local_process_index :0 [2022-11-03 17:38:21,034] [ INFO] - local_rank :-1 [2022-11-03 17:38:21,034] [ INFO] - log_level :-1 [2022-11-03 17:38:21,034] [ INFO] - log_level_replica :-1 [2022-11-03 17:38:21,035] [ INFO] - log_on_each_node :True [2022-11-03 17:38:21,035] [ INFO] - logging_dir :./checkpoint/model_best/runs/Nov03_17-38-19_jupyter-640378-4961694 [2022-11-03 17:38:21,035] [ INFO] - logging_first_step :False [2022-11-03 17:38:21,035] [ INFO] - logging_steps :10 [2022-11-03 17:38:21,035] [ INFO] - logging_strategy :IntervalStrategy.STEPS [2022-11-03 17:38:21,035] [ INFO] - lr_scheduler_type :SchedulerType.LINEAR [2022-11-03 17:38:21,035] [ INFO] - max_grad_norm :1.0 [2022-11-03 17:38:21,035] [ INFO] - max_steps :-1 [2022-11-03 17:38:21,035] [ INFO] - metric_for_best_model :eval_f1 [2022-11-03 17:38:21,035] [ INFO] - minimum_eval_times :None [2022-11-03 17:38:21,035] [ INFO] - moving_rate :0.9 [2022-11-03 17:38:21,035] [ INFO] - no_cuda :False [2022-11-03 17:38:21,035] [ INFO] - num_train_epochs :100.0 [2022-11-03 17:38:21,035] [ INFO] - optim :OptimizerNames.ADAMW [2022-11-03 17:38:21,035] [ INFO] - output_dir :./checkpoint/model_best [2022-11-03 17:38:21,035] [ INFO] - overwrite_output_dir :True [2022-11-03 17:38:21,035] [ INFO] - past_index :-1 [2022-11-03 17:38:21,035] [ INFO] - per_device_eval_batch_size :16 [2022-11-03 17:38:21,035] [ INFO] - per_device_train_batch_size :16 [2022-11-03 17:38:21,035] [ INFO] - prediction_loss_only :False [2022-11-03 17:38:21,035] [ INFO] - process_index :0 [2022-11-03 17:38:21,035] [ INFO] - recompute :False [2022-11-03 17:38:21,035] [ INFO] - remove_unused_columns :True [2022-11-03 17:38:21,035] [ INFO] - report_to :['visualdl'] [2022-11-03 17:38:21,035] [ INFO] - resume_from_checkpoint :None [2022-11-03 17:38:21,035] [ INFO] - round_type :round [2022-11-03 17:38:21,035] [ INFO] - run_name :./checkpoint/model_best [2022-11-03 17:38:21,036] [ INFO] - save_on_each_node :False [2022-11-03 17:38:21,036] [ INFO] - save_steps :100 [2022-11-03 17:38:21,036] [ INFO] - save_strategy :IntervalStrategy.STEPS [2022-11-03 17:38:21,036] [ INFO] - save_total_limit :1 [2022-11-03 17:38:21,036] [ INFO] - scale_loss :32768 [2022-11-03 17:38:21,036] [ INFO] - seed :42 [2022-11-03 17:38:21,036] [ INFO] - should_log :True [2022-11-03 17:38:21,036] [ INFO] - should_save :True [2022-11-03 17:38:21,036] [ INFO] - strategy :dynabert+ptq [2022-11-03 17:38:21,036] [ INFO] - train_batch_size :16 [2022-11-03 17:38:21,036] [ INFO] - warmup_ratio :0.1 [2022-11-03 17:38:21,036] [ INFO] - warmup_steps :0 [2022-11-03 17:38:21,036] [ INFO] - weight_decay :0.0 [2022-11-03 17:38:21,036] [ INFO] - weight_preprocess_type :None [2022-11-03 17:38:21,036] [ INFO] - weight_quantize_type :channel_wise_abs_max [2022-11-03 17:38:21,036] [ INFO] - width_mult_list :None [2022-11-03 17:38:21,036] [ INFO] - world_size :1 [2022-11-03 17:38:21,036] [ INFO] - [2022-11-03 17:38:21,037] [ INFO] - ***** Running training ***** [2022-11-03 17:38:21,037] [ INFO] - Num examples = 9048 [2022-11-03 17:38:21,037] [ INFO] - Num Epochs = 100 [2022-11-03 17:38:21,037] [ INFO] - Instantaneous batch size per device = 16 [2022-11-03 17:38:21,037] [ INFO] - Total train batch size (w. parallel, distributed & accumulation) = 16 [2022-11-03 17:38:21,037] [ INFO] - Gradient Accumulation steps = 1 [2022-11-03 17:38:21,037] [ INFO] - Total optimization steps = 56600.0 [2022-11-03 17:38:21,037] [ INFO] - Total num train samples = 904800 段错误 (核心已转储) aistudio@jupyter-640378-4961694:~$

joesong168 avatar Nov 03 '22 09:11 joesong168

  1. check paddle是否安装成功
    import paddle 
    paddle.utils.run_check()
    
    如果安装不成功,可以conda来安装 https://www.paddlepaddle.org.cn/
image
  1. 安装成功
    如显示成功安装paddle,看看显存是否溢出, per_device_train_batch_size 可以调整batch_size试试

wawltor avatar Nov 03 '22 10:11 wawltor

微调完毕后突然就压缩了并且报错,请问如何解决?

Alone749-i avatar Nov 08 '22 04:11 Alone749-i

window 平台'start_positions' 'end_positions' 要把外面的引号去掉,可以跑

QiangzhenZhu avatar Nov 08 '22 07:11 QiangzhenZhu

微调完毕后突然就压缩了并且报错,请问如何解决?

这个问题见 https://github.com/PaddlePaddle/PaddleNLP/issues/3700

pfchai avatar Nov 08 '22 13:11 pfchai

看下scipy的版本,需要满足 scipy<=1.3.1 & scipy>=1.7.3

LiuChiachi avatar Nov 14 '22 10:11 LiuChiachi

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

github-actions[bot] avatar Jan 14 '23 00:01 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。

github-actions[bot] avatar Jan 28 '23 00:01 github-actions[bot]