PaddleNLP [Bug]: 精调UIE模型报错

[Bug]: 精调UIE模型报错

Open joesong168 opened this issue 1 year ago • 12 comments

软件环境

- paddlepaddle:
- paddlepaddle-gpu: 2.3.2
- paddlenlp: 2.4

重复问题

[X] I have searched the existing issues

错误描述

Traceback (most recent call last):
  File "/home/aistudio/PaddleNLP/model_zoo/uie/finetune.py", line 287, in <module>
    main()
  File "/home/aistudio/PaddleNLP/model_zoo/uie/finetune.py", line 209, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/trainer/trainer_base.py", line 582, in train
    ignore_keys_for_eval)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/trainer/trainer_base.py", line 710, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/trainer/trainer_base.py", line 1324, in evaluate
    metric_key_prefix=metric_key_prefix,
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/trainer/trainer_base.py", line 1423, in evaluation_loop
    ignore_keys=ignore_keys)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/trainer/trainer_base.py", line 1605, in prediction_step
    outputs = model(**inputs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
TypeError: forward() got an unexpected keyword argument 'start_positions'

稳定复现步骤 & 代码

python3 /home/aistudio/PaddleNLP/model_zoo/uie/finetune.py
--device gpu
--logging_steps 10
--save_steps 100
--eval_steps 100
--seed 42
--model_name_or_path uie-base
--output_dir $finetuned_model
--train_path data/train.txt
--dev_path data/dev.txt
--max_seq_length 512
--per_device_eval_batch_size 16
--per_device_train_batch_size 16
--num_train_epochs 100
--learning_rate 1e-5
--do_train
--do_eval
--do_export
--export_model_dir $finetuned_model
--overwrite_output_dir
--disable_tqdm True
--metric_for_best_model eval_f1
--load_best_model_at_end True
--save_total_limit 1

Nov 03 '22 08:11 joesong168

可以加上 --label_names 'start_positions end_positions' 这个参数选项最近的版本有些改动，我们马上修复一下

Nov 03 '22 08:11 wawltor

可以加上 --label_names 'start_positions end_positions' 这个参数选项最近的版本有些改动，我们马上修复一下

我也遇到这个问题，加上了参数选项，依旧报相同的错

Nov 03 '22 09:11 pfchai

同样的问题，加上了--label_names 参数，依旧报错

Nov 03 '22 09:11 starryzwh

[2022-11-03 17:16:32,036] [ INFO] - ***** Running training ***** [2022-11-03 17:16:32,036] [ INFO] - Num examples = 9048 [2022-11-03 17:16:32,036] [ INFO] - Num Epochs = 100 [2022-11-03 17:16:32,036] [ INFO] - Instantaneous batch size per device = 16 [2022-11-03 17:16:32,036] [ INFO] - Total train batch size (w. parallel, distributed & accumulation) = 16 [2022-11-03 17:16:32,036] [ INFO] - Gradient Accumulation steps = 1 [2022-11-03 17:16:32,036] [ INFO] - Total optimization steps = 56600.0 [2022-11-03 17:16:32,036] [ INFO] - Total num train samples = 904800 段错误 (核心已转储)

Nov 03 '22 09:11 joesong168

@pfchai 又试了一下 --label_names 后面不需要''引号，运行正常

Nov 03 '22 09:11 starryzwh

抱歉，文档已经修复

Nov 03 '22 09:11 wawltor

export finetuned_model=./checkpoint/model_best

python3 /home/aistudio/PaddleNLP/model_zoo/uie/finetune.py
--device gpu
--logging_steps 10
--save_steps 100
--eval_steps 100
--seed 42
--model_name_or_path uie-nano
--output_dir $finetuned_model
--train_path data/data175474/train.txt
--dev_path data/data175474/dev.txt
--max_seq_length 512
--per_device_eval_batch_size 16
--per_device_train_batch_size 16
--num_train_epochs 100
--learning_rate 1e-5
--do_train
--do_eval
--do_export
--export_model_dir $finetuned_model
--overwrite_output_dir
--disable_tqdm True
--metric_for_best_model eval_f1
--load_best_model_at_end True
--save_total_limit 1
--label_names start_positions end_positions

段错误 (核心已转储)

还是报错

Nov 03 '22 09:11 joesong168

[2022-11-03 17:38:19,422] [ WARNING] - evaluation_strategy reset to IntervalStrategy.STEPS for do_eval is True. you can also set evaluation_strategy='epoch'. [2022-11-03 17:38:19,422] [ INFO] - The default value for the training argument --report_to will change in v5 (from all installed integrations to none). In v5, you will need to use --report_to all to get the same behavior as now. You should start updating your code and make this info disappear :-). [2022-11-03 17:38:19,422] [ INFO] - ============================================================ [2022-11-03 17:38:19,422] [ INFO] - Model Configuration Arguments
[2022-11-03 17:38:19,422] [ INFO] - paddle commit id :4596b9a22540fb0ea5d369c3c804544de61d03d0 [2022-11-03 17:38:19,422] [ INFO] - export_model_dir :./checkpoint/model_best [2022-11-03 17:38:19,422] [ INFO] - model_name_or_path :uie-nano [2022-11-03 17:38:19,422] [ INFO] - multilingual :False [2022-11-03 17:38:19,422] [ INFO] - [2022-11-03 17:38:19,422] [ INFO] - ============================================================ [2022-11-03 17:38:19,422] [ INFO] - Data Configuration Arguments
[2022-11-03 17:38:19,422] [ INFO] - paddle commit id :4596b9a22540fb0ea5d369c3c804544de61d03d0 [2022-11-03 17:38:19,423] [ INFO] - dev_path :data/data175474/dev.txt [2022-11-03 17:38:19,423] [ INFO] - max_seq_length :512 [2022-11-03 17:38:19,423] [ INFO] - train_path :data/data175474/train.txt [2022-11-03 17:38:19,423] [ INFO] - [2022-11-03 17:38:19,423] [ WARNING] - Process rank: -1, device: gpu, world_size: 1, distributed training: False, 16-bits training: False [2022-11-03 17:38:19,423] [ INFO] - Downloading resource files... [2022-11-03 17:38:19,425] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'uie-nano'. W1103 17:38:19.452342 2234 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2 W1103 17:38:19.456578 2234 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2. [2022-11-03 17:38:21,033] [ INFO] - ============================================================ [2022-11-03 17:38:21,033] [ INFO] - Training Configuration Arguments
[2022-11-03 17:38:21,033] [ INFO] - paddle commit id :4596b9a22540fb0ea5d369c3c804544de61d03d0 [2022-11-03 17:38:21,033] [ INFO] - _no_sync_in_gradient_accumulation:True [2022-11-03 17:38:21,033] [ INFO] - activation_preprocess_type :None [2022-11-03 17:38:21,033] [ INFO] - activation_quantize_type :None [2022-11-03 17:38:21,033] [ INFO] - adam_beta1 :0.9 [2022-11-03 17:38:21,033] [ INFO] - adam_beta2 :0.999 [2022-11-03 17:38:21,033] [ INFO] - adam_epsilon :1e-08 [2022-11-03 17:38:21,033] [ INFO] - algo_list :None [2022-11-03 17:38:21,033] [ INFO] - batch_num_list :None [2022-11-03 17:38:21,033] [ INFO] - batch_size_list :None [2022-11-03 17:38:21,033] [ INFO] - bias_correction :False [2022-11-03 17:38:21,033] [ INFO] - current_device :gpu:0 [2022-11-03 17:38:21,034] [ INFO] - dataloader_drop_last :False [2022-11-03 17:38:21,034] [ INFO] - dataloader_num_workers :0 [2022-11-03 17:38:21,034] [ INFO] - device :gpu [2022-11-03 17:38:21,034] [ INFO] - disable_tqdm :True [2022-11-03 17:38:21,034] [ INFO] - do_eval :True [2022-11-03 17:38:21,034] [ INFO] - do_export :True [2022-11-03 17:38:21,034] [ INFO] - do_predict :False [2022-11-03 17:38:21,034] [ INFO] - do_train :True [2022-11-03 17:38:21,034] [ INFO] - eval_batch_size :16 [2022-11-03 17:38:21,034] [ INFO] - eval_steps :100 [2022-11-03 17:38:21,034] [ INFO] - evaluation_strategy :IntervalStrategy.STEPS [2022-11-03 17:38:21,034] [ INFO] - fp16 :False [2022-11-03 17:38:21,034] [ INFO] - fp16_opt_level :O1 [2022-11-03 17:38:21,034] [ INFO] - gradient_accumulation_steps :1 [2022-11-03 17:38:21,034] [ INFO] - greater_is_better :True [2022-11-03 17:38:21,034] [ INFO] - ignore_data_skip :False [2022-11-03 17:38:21,034] [ INFO] - input_infer_model_path :None [2022-11-03 17:38:21,034] [ INFO] - label_names :['start_positions', 'end_positions'] [2022-11-03 17:38:21,034] [ INFO] - learning_rate :1e-05 [2022-11-03 17:38:21,034] [ INFO] - load_best_model_at_end :True [2022-11-03 17:38:21,034] [ INFO] - local_process_index :0 [2022-11-03 17:38:21,034] [ INFO] - local_rank :-1 [2022-11-03 17:38:21,034] [ INFO] - log_level :-1 [2022-11-03 17:38:21,034] [ INFO] - log_level_replica :-1 [2022-11-03 17:38:21,035] [ INFO] - log_on_each_node :True [2022-11-03 17:38:21,035] [ INFO] - logging_dir :./checkpoint/model_best/runs/Nov03_17-38-19_jupyter-640378-4961694 [2022-11-03 17:38:21,035] [ INFO] - logging_first_step :False [2022-11-03 17:38:21,035] [ INFO] - logging_steps :10 [2022-11-03 17:38:21,035] [ INFO] - logging_strategy :IntervalStrategy.STEPS [2022-11-03 17:38:21,035] [ INFO] - lr_scheduler_type :SchedulerType.LINEAR [2022-11-03 17:38:21,035] [ INFO] - max_grad_norm :1.0 [2022-11-03 17:38:21,035] [ INFO] - max_steps :-1 [2022-11-03 17:38:21,035] [ INFO] - metric_for_best_model :eval_f1 [2022-11-03 17:38:21,035] [ INFO] - minimum_eval_times :None [2022-11-03 17:38:21,035] [ INFO] - moving_rate :0.9 [2022-11-03 17:38:21,035] [ INFO] - no_cuda :False [2022-11-03 17:38:21,035] [ INFO] - num_train_epochs :100.0 [2022-11-03 17:38:21,035] [ INFO] - optim :OptimizerNames.ADAMW [2022-11-03 17:38:21,035] [ INFO] - output_dir :./checkpoint/model_best [2022-11-03 17:38:21,035] [ INFO] - overwrite_output_dir :True [2022-11-03 17:38:21,035] [ INFO] - past_index :-1 [2022-11-03 17:38:21,035] [ INFO] - per_device_eval_batch_size :16 [2022-11-03 17:38:21,035] [ INFO] - per_device_train_batch_size :16 [2022-11-03 17:38:21,035] [ INFO] - prediction_loss_only :False [2022-11-03 17:38:21,035] [ INFO] - process_index :0 [2022-11-03 17:38:21,035] [ INFO] - recompute :False [2022-11-03 17:38:21,035] [ INFO] - remove_unused_columns :True [2022-11-03 17:38:21,035] [ INFO] - report_to :['visualdl'] [2022-11-03 17:38:21,035] [ INFO] - resume_from_checkpoint :None [2022-11-03 17:38:21,035] [ INFO] - round_type :round [2022-11-03 17:38:21,035] [ INFO] - run_name :./checkpoint/model_best [2022-11-03 17:38:21,036] [ INFO] - save_on_each_node :False [2022-11-03 17:38:21,036] [ INFO] - save_steps :100 [2022-11-03 17:38:21,036] [ INFO] - save_strategy :IntervalStrategy.STEPS [2022-11-03 17:38:21,036] [ INFO] - save_total_limit :1 [2022-11-03 17:38:21,036] [ INFO] - scale_loss :32768 [2022-11-03 17:38:21,036] [ INFO] - seed :42 [2022-11-03 17:38:21,036] [ INFO] - should_log :True [2022-11-03 17:38:21,036] [ INFO] - should_save :True [2022-11-03 17:38:21,036] [ INFO] - strategy :dynabert+ptq [2022-11-03 17:38:21,036] [ INFO] - train_batch_size :16 [2022-11-03 17:38:21,036] [ INFO] - warmup_ratio :0.1 [2022-11-03 17:38:21,036] [ INFO] - warmup_steps :0 [2022-11-03 17:38:21,036] [ INFO] - weight_decay :0.0 [2022-11-03 17:38:21,036] [ INFO] - weight_preprocess_type :None [2022-11-03 17:38:21,036] [ INFO] - weight_quantize_type :channel_wise_abs_max [2022-11-03 17:38:21,036] [ INFO] - width_mult_list :None [2022-11-03 17:38:21,036] [ INFO] - world_size :1 [2022-11-03 17:38:21,036] [ INFO] - [2022-11-03 17:38:21,037] [ INFO] - ***** Running training ***** [2022-11-03 17:38:21,037] [ INFO] - Num examples = 9048 [2022-11-03 17:38:21,037] [ INFO] - Num Epochs = 100 [2022-11-03 17:38:21,037] [ INFO] - Instantaneous batch size per device = 16 [2022-11-03 17:38:21,037] [ INFO] - Total train batch size (w. parallel, distributed & accumulation) = 16 [2022-11-03 17:38:21,037] [ INFO] - Gradient Accumulation steps = 1 [2022-11-03 17:38:21,037] [ INFO] - Total optimization steps = 56600.0 [2022-11-03 17:38:21,037] [ INFO] - Total num train samples = 904800 段错误 (核心已转储) aistudio@jupyter-640378-4961694:~$

Nov 03 '22 09:11 joesong168