PaddleNLP **非常感谢提供了predictor代码，可在运行时有问题请教下：**

非常感谢提供了predictor代码，可在运行时有问题请教下：

1、采用 cblue 训练模型 python train_spo.py --batch_size 12 --max_seq_length 300 --learning_rate 6e-5 --epochs 3 对模型进行预训练，在Macbook上，用CPU跑，但 spo_loss 从很大值降到 100多，spo fi: 始终为0，这是正常的吗？

global step 2300, epoch: 1, batch: 2300, loss: 223.04558, ent_loss: 114.37089, spo_loss: 108.67469, speed: 0.71 steps/s
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 897/897 [46:15<00:00,  3.09s/it]
eval loss: 249.20973, entity f1: 0.00000, spo f1: 0.00000
[2022-07-27 13:26:15,749] [    INFO] - tokenizer config file saved in ./checkpoint/model_2300/tokenizer_config.json
[2022-07-27 13:26:15,750] [    INFO] - Special tokens file saved in ./checkpoint/model_2300/special_tokens_map.json

2、我用中间模型输出静态图做预测，跑的结果出错，如下：

` python infer_spo.py --device cpu --dataset CMeIE --model_path_prefix ../../cblue/export_CMeIE/inference

[2022-07-27 14:27:25,541] [    INFO] - model_path_prefix   : ../../cblue/export_CMeIE/inference
[2022-07-27 14:27:25,541] [    INFO] - model_name_or_path  : ernie-health-chinese
[2022-07-27 14:27:25,541] [    INFO] - dataset             : CMeIE
[2022-07-27 14:27:25,541] [    INFO] - data_file           : None
[2022-07-27 14:27:25,541] [    INFO] - max_seq_length      : 300
[2022-07-27 14:27:25,541] [    INFO] - use_fp16            : False
[2022-07-27 14:27:25,542] [    INFO] - num_threads         : 4
[2022-07-27 14:27:25,542] [    INFO] - batch_size          : 20
[2022-07-27 14:27:25,542] [    INFO] - device              : cpu
[2022-07-27 14:27:25,542] [    INFO] - device_id           : 0
[2022-07-27 14:27:25,542] [ WARNING] - Can't find the faster_tokenizer package, please ensure install faster_tokenizer correctly. You can install faster_tokenizer by `pip install faster_tokenizer`(Currently only work for linux platform).
[2022-07-27 14:27:25,542] [    INFO] - We are using <class 'paddlenlp.transformers.electra.tokenizer.ElectraTokenizer'> to load 'ernie-health-chinese'.
[2022-07-27 14:27:25,542] [    INFO] - Already cached /Users/lizzysong/.paddlenlp/models/ernie-health-chinese/vocab.txt
[2022-07-27 14:27:25,558] [    INFO] - tokenizer config file saved in /Users/lizzysong/.paddlenlp/models/ernie-health-chinese/tokenizer_config.json
[2022-07-27 14:27:25,558] [    INFO] - Special tokens file saved in /Users/lizzysong/.paddlenlp/models/ernie-health-chinese/special_tokens_map.json
[2022-07-27 14:27:25,558] [    INFO] - >>> [InferBackend] Creating Engine ...
[Paddle2ONNX] Start to parse PaddlePaddle model...
[Paddle2ONNX] Model file path: ../../cblue/export_CMeIE/inference.pdmodel
[Paddle2ONNX] Paramters file path: ../../cblue/export_CMeIE/inference.pdiparams
[Paddle2ONNX] Start to parsing Paddle model...
[Paddle2ONNX] Use opset_version = 13 for ONNX export.
[Paddle2ONNX] PaddlePaddle model is exported as ONNX format now.
[2022-07-27 14:27:31,926] [    INFO] - >>> [InferBackend] Use CPU to inference ...
[2022-07-27 14:27:33,617] [    INFO] - >>> [InferBackend] Engine Created ...
Traceback (most recent call last):
  File "/Users/lizzysong/PaddleNLP/model_zoo/ernie-health/deploy/predictor/infer_spo.py", line 69, in <module>
    predictor.predict(input_data)
  File "/Users/lizzysong/PaddleNLP/model_zoo/ernie-health/deploy/predictor/predictor.py", line 320, in predict
    infer_result = self.infer_batch(encoded_inputs)
  File "/Users/lizzysong/PaddleNLP/model_zoo/ernie-health/deploy/predictor/predictor.py", line 151, in infer_batch
    results = self._infer(input_dict)
  File "/Users/lizzysong/PaddleNLP/model_zoo/ernie-health/deploy/predictor/predictor.py", line 140, in _infer
    infer_data = self.inference_backend.infer(input_dict)
  File "/Users/lizzysong/PaddleNLP/model_zoo/ernie-health/deploy/predictor/predictor.py", line 116, in infer
    result = self.predictor.run(None, input_dict)
  File "/Users/lizzysong/opt/anaconda3/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 188, in run
    raise ValueError("Model requires {} inputs. Input Feed contains {}".format(num_required_inputs, num_inputs))
ValueError: Model requires 4 inputs. Input Feed contains 3

Originally posted by @charlieliu9999 in https://github.com/PaddlePaddle/PaddleNLP/issues/2827#issuecomment-1196326683

Jul 28 '22 03:07 charlieliu9999

CMeIE数据集上收敛速度是比较慢，我这边的经验是4卡训练5000~6000steps之后F1值才会从0逐渐增加，训练大概100 epochs才能达到README表格中给出的F1值。
预测时用的export_model.py是拉的最新代码么，这里的静态图导出的实现也有更新，可以用最新代码导出一版再试一试。

Jul 28 '22 06:07 LemonNoel

更新了 export_model.py 可以跑出来了, 模型训练不够，结果不行

[2022-07-28 16:15:31,769] [ INFO] - >>> [InferBackend] Use CPU to inference ... [2022-07-28 16:15:33,716] [ INFO] - >>> [InferBackend] Engine Created ... [2022-07-28 16:15:34,248] [ INFO] - input data: 骶髂关节炎是明确诊断JAS的关键条件。若有肋椎关节病变会使胸部扩张度减小。 [2022-07-28 16:15:34,248] [ INFO] - detected entities and relations: [2022-07-28 16:15:34,248] [ INFO] - ----------------------------- [2022-07-28 16:15:34,249] [ INFO] - input data: 稳定型缺血性心脏疾病@肥胖与缺乏活动也导致高血压增多。 [2022-07-28 16:15:34,249] [ INFO] - detected entities and relations: [2022-07-28 16:15:34,249] [ INFO] - * entity: 稳定型缺血性心脏疾病, position: (0, 9) [2022-07-28 16:15:34,249] [ INFO] - ----------------------------- [2022-07-28 16:15:34,249] [ INFO] - input data: 呼吸内科 ,071岁,M,因反复咳嗽30年，气促3年，再发伴发热10余天。入院。胸廓桶状胸，肋间隙增宽，语颤减弱，叩诊过清音 [2022-07-28 16:15:34,249] [ INFO] - detected entities and relations: [2022-07-28 16:15:34,249] [ INFO] - * entity: 呼吸内科 ,071岁,M,因反复咳嗽30年，气促3年，再发伴发热10余天。入院。胸廓桶状胸, position: (0, 44) [2022-07-28 16:15:34,250] [ INFO] - ----------------------------- [2022-07-28 16:15:34,250] [ INFO] - input data: 反复咳嗽、咳痰、活动后气促10年，再发加重3天 [2022-07-28 16:15:34,250] [ INFO] - detected entities and relations: [2022-07-28 16:15:34,250] [ INFO] - ----------------------------- [2022-07-28 16:15:34,250] [ INFO] - input data: 右侧腰痛2年余 [2022-07-28 16:15:34,250] [ INFO] - detected entities and relations: [2022-07-28 16:15:34,250] [ INFO] - -----------------------------

Jul 28 '22 08:07 charlieliu9999

再请教个问题，如果要提取医学SPO实体关系，建议用 ernie-health 训练的模型，还是在 uie 模型上微调？如果 uie , 采用 uie-tiny 还是 uie-medic 哪个效果更好？

Jul 28 '22 08:07 charlieliu9999

具体效果取决于实际数据。

ernie-health对训练数据量要求比较高，比如CMeIE有1.4万条左右，如果实际数据和CMeIE中抽取的schema类似，可以尝试混合训练或者微调好的模型来看看效果。
uie在SPO设计上相对灵活，可以直接用taskflow调用uie-base和uie-medical分别试下效果。如果效果不能达到预期，建议在uie-base上进行微调，此时也需要一定量的数据标注。

Jul 28 '22 11:07 LemonNoel

uie 模型，定义是用小样本微调，用大数据量是否能有效果？多少数据量是合适的？我用150多标注数据（一个科室数据），训练 uie-tiny ，有一定效果，实体识别较好，关系抽取差些。在 uie 模型中，标注数据是否要涵盖所有实体和关系？对于未学习过的实体能否推理出来

Jul 29 '22 06:07 charlieliu9999

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动，被标记为stale。

Dec 08 '22 02:12 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天，即将关闭。

Dec 22 '22 16:12 github-actions[bot]