PaddleNLP
PaddleNLP copied to clipboard
[Bug]: 精调UIE模型报错
软件环境
- paddlepaddle:
- paddlepaddle-gpu: 2.3.2
- paddlenlp: 2.4
重复问题
- [X] I have searched the existing issues
错误描述
Traceback (most recent call last):
File "/home/aistudio/PaddleNLP/model_zoo/uie/finetune.py", line 287, in <module>
main()
File "/home/aistudio/PaddleNLP/model_zoo/uie/finetune.py", line 209, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/trainer/trainer_base.py", line 582, in train
ignore_keys_for_eval)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/trainer/trainer_base.py", line 710, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/trainer/trainer_base.py", line 1324, in evaluate
metric_key_prefix=metric_key_prefix,
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/trainer/trainer_base.py", line 1423, in evaluation_loop
ignore_keys=ignore_keys)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/trainer/trainer_base.py", line 1605, in prediction_step
outputs = model(**inputs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
return self._dygraph_call_func(*inputs, **kwargs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
TypeError: forward() got an unexpected keyword argument 'start_positions'
稳定复现步骤 & 代码
python3 /home/aistudio/PaddleNLP/model_zoo/uie/finetune.py
--device gpu
--logging_steps 10
--save_steps 100
--eval_steps 100
--seed 42
--model_name_or_path uie-base
--output_dir $finetuned_model
--train_path data/train.txt
--dev_path data/dev.txt
--max_seq_length 512
--per_device_eval_batch_size 16
--per_device_train_batch_size 16
--num_train_epochs 100
--learning_rate 1e-5
--do_train
--do_eval
--do_export
--export_model_dir $finetuned_model
--overwrite_output_dir
--disable_tqdm True
--metric_for_best_model eval_f1
--load_best_model_at_end True
--save_total_limit 1
可以加上 --label_names 'start_positions end_positions' 这个参数选项 最近的版本有些改动,我们马上修复一下
可以加上 --label_names 'start_positions end_positions' 这个参数选项 最近的版本有些改动,我们马上修复一下
我也遇到这个问题,加上了参数选项,依旧报相同的错
同样的问题,加上了--label_names 参数,依旧报错
[2022-11-03 17:16:32,036] [ INFO] - ***** Running training ***** [2022-11-03 17:16:32,036] [ INFO] - Num examples = 9048 [2022-11-03 17:16:32,036] [ INFO] - Num Epochs = 100 [2022-11-03 17:16:32,036] [ INFO] - Instantaneous batch size per device = 16 [2022-11-03 17:16:32,036] [ INFO] - Total train batch size (w. parallel, distributed & accumulation) = 16 [2022-11-03 17:16:32,036] [ INFO] - Gradient Accumulation steps = 1 [2022-11-03 17:16:32,036] [ INFO] - Total optimization steps = 56600.0 [2022-11-03 17:16:32,036] [ INFO] - Total num train samples = 904800 段错误 (核心已转储)
@pfchai 又试了一下 --label_names 后面不需要''引号,运行正常
抱歉,文档已经修复
export finetuned_model=./checkpoint/model_best
python3 /home/aistudio/PaddleNLP/model_zoo/uie/finetune.py
--device gpu
--logging_steps 10
--save_steps 100
--eval_steps 100
--seed 42
--model_name_or_path uie-nano
--output_dir $finetuned_model
--train_path data/data175474/train.txt
--dev_path data/data175474/dev.txt
--max_seq_length 512
--per_device_eval_batch_size 16
--per_device_train_batch_size 16
--num_train_epochs 100
--learning_rate 1e-5
--do_train
--do_eval
--do_export
--export_model_dir $finetuned_model
--overwrite_output_dir
--disable_tqdm True
--metric_for_best_model eval_f1
--load_best_model_at_end True
--save_total_limit 1
--label_names start_positions end_positions
段错误 (核心已转储)
还是报错
[2022-11-03 17:38:19,422] [ WARNING] - evaluation_strategy reset to IntervalStrategy.STEPS for do_eval is True. you can also set evaluation_strategy='epoch'.
[2022-11-03 17:38:19,422] [ INFO] - The default value for the training argument --report_to
will change in v5 (from all installed integrations to none). In v5, you will need to use --report_to all
to get the same behavior as now. You should start updating your code and make this info disappear :-).
[2022-11-03 17:38:19,422] [ INFO] - ============================================================
[2022-11-03 17:38:19,422] [ INFO] - Model Configuration Arguments
[2022-11-03 17:38:19,422] [ INFO] - paddle commit id :4596b9a22540fb0ea5d369c3c804544de61d03d0
[2022-11-03 17:38:19,422] [ INFO] - export_model_dir :./checkpoint/model_best
[2022-11-03 17:38:19,422] [ INFO] - model_name_or_path :uie-nano
[2022-11-03 17:38:19,422] [ INFO] - multilingual :False
[2022-11-03 17:38:19,422] [ INFO] -
[2022-11-03 17:38:19,422] [ INFO] - ============================================================
[2022-11-03 17:38:19,422] [ INFO] - Data Configuration Arguments
[2022-11-03 17:38:19,422] [ INFO] - paddle commit id :4596b9a22540fb0ea5d369c3c804544de61d03d0
[2022-11-03 17:38:19,423] [ INFO] - dev_path :data/data175474/dev.txt
[2022-11-03 17:38:19,423] [ INFO] - max_seq_length :512
[2022-11-03 17:38:19,423] [ INFO] - train_path :data/data175474/train.txt
[2022-11-03 17:38:19,423] [ INFO] -
[2022-11-03 17:38:19,423] [ WARNING] - Process rank: -1, device: gpu, world_size: 1, distributed training: False, 16-bits training: False
[2022-11-03 17:38:19,423] [ INFO] - Downloading resource files...
[2022-11-03 17:38:19,425] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'uie-nano'.
W1103 17:38:19.452342 2234 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W1103 17:38:19.456578 2234 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
[2022-11-03 17:38:21,033] [ INFO] - ============================================================
[2022-11-03 17:38:21,033] [ INFO] - Training Configuration Arguments
[2022-11-03 17:38:21,033] [ INFO] - paddle commit id :4596b9a22540fb0ea5d369c3c804544de61d03d0
[2022-11-03 17:38:21,033] [ INFO] - _no_sync_in_gradient_accumulation:True
[2022-11-03 17:38:21,033] [ INFO] - activation_preprocess_type :None
[2022-11-03 17:38:21,033] [ INFO] - activation_quantize_type :None
[2022-11-03 17:38:21,033] [ INFO] - adam_beta1 :0.9
[2022-11-03 17:38:21,033] [ INFO] - adam_beta2 :0.999
[2022-11-03 17:38:21,033] [ INFO] - adam_epsilon :1e-08
[2022-11-03 17:38:21,033] [ INFO] - algo_list :None
[2022-11-03 17:38:21,033] [ INFO] - batch_num_list :None
[2022-11-03 17:38:21,033] [ INFO] - batch_size_list :None
[2022-11-03 17:38:21,033] [ INFO] - bias_correction :False
[2022-11-03 17:38:21,033] [ INFO] - current_device :gpu:0
[2022-11-03 17:38:21,034] [ INFO] - dataloader_drop_last :False
[2022-11-03 17:38:21,034] [ INFO] - dataloader_num_workers :0
[2022-11-03 17:38:21,034] [ INFO] - device :gpu
[2022-11-03 17:38:21,034] [ INFO] - disable_tqdm :True
[2022-11-03 17:38:21,034] [ INFO] - do_eval :True
[2022-11-03 17:38:21,034] [ INFO] - do_export :True
[2022-11-03 17:38:21,034] [ INFO] - do_predict :False
[2022-11-03 17:38:21,034] [ INFO] - do_train :True
[2022-11-03 17:38:21,034] [ INFO] - eval_batch_size :16
[2022-11-03 17:38:21,034] [ INFO] - eval_steps :100
[2022-11-03 17:38:21,034] [ INFO] - evaluation_strategy :IntervalStrategy.STEPS
[2022-11-03 17:38:21,034] [ INFO] - fp16 :False
[2022-11-03 17:38:21,034] [ INFO] - fp16_opt_level :O1
[2022-11-03 17:38:21,034] [ INFO] - gradient_accumulation_steps :1
[2022-11-03 17:38:21,034] [ INFO] - greater_is_better :True
[2022-11-03 17:38:21,034] [ INFO] - ignore_data_skip :False
[2022-11-03 17:38:21,034] [ INFO] - input_infer_model_path :None
[2022-11-03 17:38:21,034] [ INFO] - label_names :['start_positions', 'end_positions']
[2022-11-03 17:38:21,034] [ INFO] - learning_rate :1e-05
[2022-11-03 17:38:21,034] [ INFO] - load_best_model_at_end :True
[2022-11-03 17:38:21,034] [ INFO] - local_process_index :0
[2022-11-03 17:38:21,034] [ INFO] - local_rank :-1
[2022-11-03 17:38:21,034] [ INFO] - log_level :-1
[2022-11-03 17:38:21,034] [ INFO] - log_level_replica :-1
[2022-11-03 17:38:21,035] [ INFO] - log_on_each_node :True
[2022-11-03 17:38:21,035] [ INFO] - logging_dir :./checkpoint/model_best/runs/Nov03_17-38-19_jupyter-640378-4961694
[2022-11-03 17:38:21,035] [ INFO] - logging_first_step :False
[2022-11-03 17:38:21,035] [ INFO] - logging_steps :10
[2022-11-03 17:38:21,035] [ INFO] - logging_strategy :IntervalStrategy.STEPS
[2022-11-03 17:38:21,035] [ INFO] - lr_scheduler_type :SchedulerType.LINEAR
[2022-11-03 17:38:21,035] [ INFO] - max_grad_norm :1.0
[2022-11-03 17:38:21,035] [ INFO] - max_steps :-1
[2022-11-03 17:38:21,035] [ INFO] - metric_for_best_model :eval_f1
[2022-11-03 17:38:21,035] [ INFO] - minimum_eval_times :None
[2022-11-03 17:38:21,035] [ INFO] - moving_rate :0.9
[2022-11-03 17:38:21,035] [ INFO] - no_cuda :False
[2022-11-03 17:38:21,035] [ INFO] - num_train_epochs :100.0
[2022-11-03 17:38:21,035] [ INFO] - optim :OptimizerNames.ADAMW
[2022-11-03 17:38:21,035] [ INFO] - output_dir :./checkpoint/model_best
[2022-11-03 17:38:21,035] [ INFO] - overwrite_output_dir :True
[2022-11-03 17:38:21,035] [ INFO] - past_index :-1
[2022-11-03 17:38:21,035] [ INFO] - per_device_eval_batch_size :16
[2022-11-03 17:38:21,035] [ INFO] - per_device_train_batch_size :16
[2022-11-03 17:38:21,035] [ INFO] - prediction_loss_only :False
[2022-11-03 17:38:21,035] [ INFO] - process_index :0
[2022-11-03 17:38:21,035] [ INFO] - recompute :False
[2022-11-03 17:38:21,035] [ INFO] - remove_unused_columns :True
[2022-11-03 17:38:21,035] [ INFO] - report_to :['visualdl']
[2022-11-03 17:38:21,035] [ INFO] - resume_from_checkpoint :None
[2022-11-03 17:38:21,035] [ INFO] - round_type :round
[2022-11-03 17:38:21,035] [ INFO] - run_name :./checkpoint/model_best
[2022-11-03 17:38:21,036] [ INFO] - save_on_each_node :False
[2022-11-03 17:38:21,036] [ INFO] - save_steps :100
[2022-11-03 17:38:21,036] [ INFO] - save_strategy :IntervalStrategy.STEPS
[2022-11-03 17:38:21,036] [ INFO] - save_total_limit :1
[2022-11-03 17:38:21,036] [ INFO] - scale_loss :32768
[2022-11-03 17:38:21,036] [ INFO] - seed :42
[2022-11-03 17:38:21,036] [ INFO] - should_log :True
[2022-11-03 17:38:21,036] [ INFO] - should_save :True
[2022-11-03 17:38:21,036] [ INFO] - strategy :dynabert+ptq
[2022-11-03 17:38:21,036] [ INFO] - train_batch_size :16
[2022-11-03 17:38:21,036] [ INFO] - warmup_ratio :0.1
[2022-11-03 17:38:21,036] [ INFO] - warmup_steps :0
[2022-11-03 17:38:21,036] [ INFO] - weight_decay :0.0
[2022-11-03 17:38:21,036] [ INFO] - weight_preprocess_type :None
[2022-11-03 17:38:21,036] [ INFO] - weight_quantize_type :channel_wise_abs_max
[2022-11-03 17:38:21,036] [ INFO] - width_mult_list :None
[2022-11-03 17:38:21,036] [ INFO] - world_size :1
[2022-11-03 17:38:21,036] [ INFO] -
[2022-11-03 17:38:21,037] [ INFO] - ***** Running training *****
[2022-11-03 17:38:21,037] [ INFO] - Num examples = 9048
[2022-11-03 17:38:21,037] [ INFO] - Num Epochs = 100
[2022-11-03 17:38:21,037] [ INFO] - Instantaneous batch size per device = 16
[2022-11-03 17:38:21,037] [ INFO] - Total train batch size (w. parallel, distributed & accumulation) = 16
[2022-11-03 17:38:21,037] [ INFO] - Gradient Accumulation steps = 1
[2022-11-03 17:38:21,037] [ INFO] - Total optimization steps = 56600.0
[2022-11-03 17:38:21,037] [ INFO] - Total num train samples = 904800
段错误 (核心已转储)
aistudio@jupyter-640378-4961694:~$
- check paddle是否安装成功
如果安装不成功,可以conda来安装 https://www.paddlepaddle.org.cn/import paddle paddle.utils.run_check()
- 安装成功
如显示成功安装paddle,看看显存是否溢出, per_device_train_batch_size 可以调整batch_size试试
微调完毕后突然就压缩了并且报错,请问如何解决?
window 平台'start_positions' 'end_positions' 要把外面的引号去掉,可以跑
微调完毕后突然就压缩了并且报错,请问如何解决?
这个问题见 https://github.com/PaddlePaddle/PaddleNLP/issues/3700
看下scipy的版本,需要满足 scipy<=1.3.1 & scipy>=1.7.3
This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。
This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。