PaddleNLP icon indicating copy to clipboard operation
PaddleNLP copied to clipboard

**非常感谢提供了predictor代码,可在运行时有问题请教下:**

Open charlieliu9999 opened this issue 3 years ago • 5 comments

非常感谢提供了predictor代码,可在运行时有问题请教下:

1、 采用 cblue 训练模型 python train_spo.py --batch_size 12 --max_seq_length 300 --learning_rate 6e-5 --epochs 3 对模型进行预训练,在Macbook上,用CPU跑, 但 spo_loss 从很大值降到 100多,spo fi: 始终为0,这是正常的吗?

global step 2300, epoch: 1, batch: 2300, loss: 223.04558, ent_loss: 114.37089, spo_loss: 108.67469, speed: 0.71 steps/s
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 897/897 [46:15<00:00,  3.09s/it]
eval loss: 249.20973, entity f1: 0.00000, spo f1: 0.00000
[2022-07-27 13:26:15,749] [    INFO] - tokenizer config file saved in ./checkpoint/model_2300/tokenizer_config.json
[2022-07-27 13:26:15,750] [    INFO] - Special tokens file saved in ./checkpoint/model_2300/special_tokens_map.json

2、我用中间模型输出 静态图做预测,跑的结果出错, 如下:

` python infer_spo.py --device cpu --dataset CMeIE --model_path_prefix ../../cblue/export_CMeIE/inference

[2022-07-27 14:27:25,541] [    INFO] - model_path_prefix   : ../../cblue/export_CMeIE/inference
[2022-07-27 14:27:25,541] [    INFO] - model_name_or_path  : ernie-health-chinese
[2022-07-27 14:27:25,541] [    INFO] - dataset             : CMeIE
[2022-07-27 14:27:25,541] [    INFO] - data_file           : None
[2022-07-27 14:27:25,541] [    INFO] - max_seq_length      : 300
[2022-07-27 14:27:25,541] [    INFO] - use_fp16            : False
[2022-07-27 14:27:25,542] [    INFO] - num_threads         : 4
[2022-07-27 14:27:25,542] [    INFO] - batch_size          : 20
[2022-07-27 14:27:25,542] [    INFO] - device              : cpu
[2022-07-27 14:27:25,542] [    INFO] - device_id           : 0
[2022-07-27 14:27:25,542] [ WARNING] - Can't find the faster_tokenizer package, please ensure install faster_tokenizer correctly. You can install faster_tokenizer by `pip install faster_tokenizer`(Currently only work for linux platform).
[2022-07-27 14:27:25,542] [    INFO] - We are using <class 'paddlenlp.transformers.electra.tokenizer.ElectraTokenizer'> to load 'ernie-health-chinese'.
[2022-07-27 14:27:25,542] [    INFO] - Already cached /Users/lizzysong/.paddlenlp/models/ernie-health-chinese/vocab.txt
[2022-07-27 14:27:25,558] [    INFO] - tokenizer config file saved in /Users/lizzysong/.paddlenlp/models/ernie-health-chinese/tokenizer_config.json
[2022-07-27 14:27:25,558] [    INFO] - Special tokens file saved in /Users/lizzysong/.paddlenlp/models/ernie-health-chinese/special_tokens_map.json
[2022-07-27 14:27:25,558] [    INFO] - >>> [InferBackend] Creating Engine ...
[Paddle2ONNX] Start to parse PaddlePaddle model...
[Paddle2ONNX] Model file path: ../../cblue/export_CMeIE/inference.pdmodel
[Paddle2ONNX] Paramters file path: ../../cblue/export_CMeIE/inference.pdiparams
[Paddle2ONNX] Start to parsing Paddle model...
[Paddle2ONNX] Use opset_version = 13 for ONNX export.
[Paddle2ONNX] PaddlePaddle model is exported as ONNX format now.
[2022-07-27 14:27:31,926] [    INFO] - >>> [InferBackend] Use CPU to inference ...
[2022-07-27 14:27:33,617] [    INFO] - >>> [InferBackend] Engine Created ...
Traceback (most recent call last):
  File "/Users/lizzysong/PaddleNLP/model_zoo/ernie-health/deploy/predictor/infer_spo.py", line 69, in <module>
    predictor.predict(input_data)
  File "/Users/lizzysong/PaddleNLP/model_zoo/ernie-health/deploy/predictor/predictor.py", line 320, in predict
    infer_result = self.infer_batch(encoded_inputs)
  File "/Users/lizzysong/PaddleNLP/model_zoo/ernie-health/deploy/predictor/predictor.py", line 151, in infer_batch
    results = self._infer(input_dict)
  File "/Users/lizzysong/PaddleNLP/model_zoo/ernie-health/deploy/predictor/predictor.py", line 140, in _infer
    infer_data = self.inference_backend.infer(input_dict)
  File "/Users/lizzysong/PaddleNLP/model_zoo/ernie-health/deploy/predictor/predictor.py", line 116, in infer
    result = self.predictor.run(None, input_dict)
  File "/Users/lizzysong/opt/anaconda3/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 188, in run
    raise ValueError("Model requires {} inputs. Input Feed contains {}".format(num_required_inputs, num_inputs))
ValueError: Model requires 4 inputs. Input Feed contains 3

Originally posted by @charlieliu9999 in https://github.com/PaddlePaddle/PaddleNLP/issues/2827#issuecomment-1196326683

charlieliu9999 avatar Jul 28 '22 03:07 charlieliu9999

  • CMeIE数据集上收敛速度是比较慢,我这边的经验是4卡训练5000~6000steps之后F1值才会从0逐渐增加,训练大概100 epochs才能达到README表格中给出的F1值。
  • 预测时用的export_model.py是拉的最新代码么,这里的静态图导出的实现也有更新,可以用最新代码导出一版再试一试。

LemonNoel avatar Jul 28 '22 06:07 LemonNoel

更新了 export_model.py 可以跑出来了, 模型训练不够,结果不行

[2022-07-28 16:15:31,769] [ INFO] - >>> [InferBackend] Use CPU to inference ... [2022-07-28 16:15:33,716] [ INFO] - >>> [InferBackend] Engine Created ... [2022-07-28 16:15:34,248] [ INFO] - input data: 骶髂关节炎是明确诊断JAS的关键条件。若有肋椎关节病变会使胸部扩张度减小。 [2022-07-28 16:15:34,248] [ INFO] - detected entities and relations: [2022-07-28 16:15:34,248] [ INFO] - ----------------------------- [2022-07-28 16:15:34,249] [ INFO] - input data: 稳定型缺血性心脏疾病@肥胖与缺乏活动也导致高血压增多。 [2022-07-28 16:15:34,249] [ INFO] - detected entities and relations: [2022-07-28 16:15:34,249] [ INFO] - * entity: 稳定型缺血性心脏疾病, position: (0, 9) [2022-07-28 16:15:34,249] [ INFO] - ----------------------------- [2022-07-28 16:15:34,249] [ INFO] - input data: 呼吸内科 ,071岁,M,因反复咳嗽30年,气促3年,再发伴发热10余天。入院。胸廓桶状胸,肋间隙增宽,语颤减弱,叩诊过清音 [2022-07-28 16:15:34,249] [ INFO] - detected entities and relations: [2022-07-28 16:15:34,249] [ INFO] - * entity: 呼吸内科 ,071岁,M,因反复咳嗽30年,气促3年,再发伴发热10余天。入院。胸廓桶状胸, position: (0, 44) [2022-07-28 16:15:34,250] [ INFO] - ----------------------------- [2022-07-28 16:15:34,250] [ INFO] - input data: 反复咳嗽、咳痰、活动后气促10年,再发加重3天 [2022-07-28 16:15:34,250] [ INFO] - detected entities and relations: [2022-07-28 16:15:34,250] [ INFO] - ----------------------------- [2022-07-28 16:15:34,250] [ INFO] - input data: 右侧腰痛2年余 [2022-07-28 16:15:34,250] [ INFO] - detected entities and relations: [2022-07-28 16:15:34,250] [ INFO] - -----------------------------

charlieliu9999 avatar Jul 28 '22 08:07 charlieliu9999

再请教个问题,如果要提取医学SPO实体关系,建议用 ernie-health 训练的模型,还是在 uie 模型上微调 ? 如果 uie , 采用 uie-tiny 还是 uie-medic 哪个效果更好?

charlieliu9999 avatar Jul 28 '22 08:07 charlieliu9999

具体效果取决于实际数据。

  • ernie-health对训练数据量要求比较高,比如CMeIE有1.4万条左右,如果实际数据和CMeIE中抽取的schema类似,可以尝试混合训练或者微调好的模型来看看效果。
  • uie在SPO设计上相对灵活,可以直接用taskflow调用uie-baseuie-medical分别试下效果。如果效果不能达到预期,建议在uie-base上进行微调,此时也需要一定量的数据标注。

LemonNoel avatar Jul 28 '22 11:07 LemonNoel

uie 模型,定义是用小样本微调,用大数据量是否能有效果?多少数据量是合适的? 我用150多标注数据(一个科室数据),训练 uie-tiny ,有一定效果,实体识别较好,关系抽取差些。 在 uie 模型中,标注数据是否要涵盖所有实体和关系?对于未学习过的实体能否推理出来

charlieliu9999 avatar Jul 29 '22 06:07 charlieliu9999

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

github-actions[bot] avatar Dec 08 '22 02:12 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。

github-actions[bot] avatar Dec 22 '22 16:12 github-actions[bot]