**非常感谢提供了predictor代码,可在运行时有问题请教下:**
非常感谢提供了predictor代码,可在运行时有问题请教下:
1、 采用 cblue 训练模型
python train_spo.py --batch_size 12 --max_seq_length 300 --learning_rate 6e-5 --epochs 3
对模型进行预训练,在Macbook上,用CPU跑,
但 spo_loss 从很大值降到 100多,spo fi: 始终为0,这是正常的吗?
global step 2300, epoch: 1, batch: 2300, loss: 223.04558, ent_loss: 114.37089, spo_loss: 108.67469, speed: 0.71 steps/s
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 897/897 [46:15<00:00, 3.09s/it]
eval loss: 249.20973, entity f1: 0.00000, spo f1: 0.00000
[2022-07-27 13:26:15,749] [ INFO] - tokenizer config file saved in ./checkpoint/model_2300/tokenizer_config.json
[2022-07-27 13:26:15,750] [ INFO] - Special tokens file saved in ./checkpoint/model_2300/special_tokens_map.json
2、我用中间模型输出 静态图做预测,跑的结果出错, 如下:
` python infer_spo.py --device cpu --dataset CMeIE --model_path_prefix ../../cblue/export_CMeIE/inference
[2022-07-27 14:27:25,541] [ INFO] - model_path_prefix : ../../cblue/export_CMeIE/inference
[2022-07-27 14:27:25,541] [ INFO] - model_name_or_path : ernie-health-chinese
[2022-07-27 14:27:25,541] [ INFO] - dataset : CMeIE
[2022-07-27 14:27:25,541] [ INFO] - data_file : None
[2022-07-27 14:27:25,541] [ INFO] - max_seq_length : 300
[2022-07-27 14:27:25,541] [ INFO] - use_fp16 : False
[2022-07-27 14:27:25,542] [ INFO] - num_threads : 4
[2022-07-27 14:27:25,542] [ INFO] - batch_size : 20
[2022-07-27 14:27:25,542] [ INFO] - device : cpu
[2022-07-27 14:27:25,542] [ INFO] - device_id : 0
[2022-07-27 14:27:25,542] [ WARNING] - Can't find the faster_tokenizer package, please ensure install faster_tokenizer correctly. You can install faster_tokenizer by `pip install faster_tokenizer`(Currently only work for linux platform).
[2022-07-27 14:27:25,542] [ INFO] - We are using <class 'paddlenlp.transformers.electra.tokenizer.ElectraTokenizer'> to load 'ernie-health-chinese'.
[2022-07-27 14:27:25,542] [ INFO] - Already cached /Users/lizzysong/.paddlenlp/models/ernie-health-chinese/vocab.txt
[2022-07-27 14:27:25,558] [ INFO] - tokenizer config file saved in /Users/lizzysong/.paddlenlp/models/ernie-health-chinese/tokenizer_config.json
[2022-07-27 14:27:25,558] [ INFO] - Special tokens file saved in /Users/lizzysong/.paddlenlp/models/ernie-health-chinese/special_tokens_map.json
[2022-07-27 14:27:25,558] [ INFO] - >>> [InferBackend] Creating Engine ...
[Paddle2ONNX] Start to parse PaddlePaddle model...
[Paddle2ONNX] Model file path: ../../cblue/export_CMeIE/inference.pdmodel
[Paddle2ONNX] Paramters file path: ../../cblue/export_CMeIE/inference.pdiparams
[Paddle2ONNX] Start to parsing Paddle model...
[Paddle2ONNX] Use opset_version = 13 for ONNX export.
[Paddle2ONNX] PaddlePaddle model is exported as ONNX format now.
[2022-07-27 14:27:31,926] [ INFO] - >>> [InferBackend] Use CPU to inference ...
[2022-07-27 14:27:33,617] [ INFO] - >>> [InferBackend] Engine Created ...
Traceback (most recent call last):
File "/Users/lizzysong/PaddleNLP/model_zoo/ernie-health/deploy/predictor/infer_spo.py", line 69, in <module>
predictor.predict(input_data)
File "/Users/lizzysong/PaddleNLP/model_zoo/ernie-health/deploy/predictor/predictor.py", line 320, in predict
infer_result = self.infer_batch(encoded_inputs)
File "/Users/lizzysong/PaddleNLP/model_zoo/ernie-health/deploy/predictor/predictor.py", line 151, in infer_batch
results = self._infer(input_dict)
File "/Users/lizzysong/PaddleNLP/model_zoo/ernie-health/deploy/predictor/predictor.py", line 140, in _infer
infer_data = self.inference_backend.infer(input_dict)
File "/Users/lizzysong/PaddleNLP/model_zoo/ernie-health/deploy/predictor/predictor.py", line 116, in infer
result = self.predictor.run(None, input_dict)
File "/Users/lizzysong/opt/anaconda3/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 188, in run
raise ValueError("Model requires {} inputs. Input Feed contains {}".format(num_required_inputs, num_inputs))
ValueError: Model requires 4 inputs. Input Feed contains 3
Originally posted by @charlieliu9999 in https://github.com/PaddlePaddle/PaddleNLP/issues/2827#issuecomment-1196326683
-
CMeIE数据集上收敛速度是比较慢,我这边的经验是4卡训练5000~6000steps之后F1值才会从0逐渐增加,训练大概100 epochs才能达到README表格中给出的F1值。 - 预测时用的
export_model.py是拉的最新代码么,这里的静态图导出的实现也有更新,可以用最新代码导出一版再试一试。
更新了 export_model.py 可以跑出来了, 模型训练不够,结果不行
[2022-07-28 16:15:31,769] [ INFO] - >>> [InferBackend] Use CPU to inference ... [2022-07-28 16:15:33,716] [ INFO] - >>> [InferBackend] Engine Created ... [2022-07-28 16:15:34,248] [ INFO] - input data: 骶髂关节炎是明确诊断JAS的关键条件。若有肋椎关节病变会使胸部扩张度减小。 [2022-07-28 16:15:34,248] [ INFO] - detected entities and relations: [2022-07-28 16:15:34,248] [ INFO] - ----------------------------- [2022-07-28 16:15:34,249] [ INFO] - input data: 稳定型缺血性心脏疾病@肥胖与缺乏活动也导致高血压增多。 [2022-07-28 16:15:34,249] [ INFO] - detected entities and relations: [2022-07-28 16:15:34,249] [ INFO] - * entity: 稳定型缺血性心脏疾病, position: (0, 9) [2022-07-28 16:15:34,249] [ INFO] - ----------------------------- [2022-07-28 16:15:34,249] [ INFO] - input data: 呼吸内科 ,071岁,M,因反复咳嗽30年,气促3年,再发伴发热10余天。入院。胸廓桶状胸,肋间隙增宽,语颤减弱,叩诊过清音 [2022-07-28 16:15:34,249] [ INFO] - detected entities and relations: [2022-07-28 16:15:34,249] [ INFO] - * entity: 呼吸内科 ,071岁,M,因反复咳嗽30年,气促3年,再发伴发热10余天。入院。胸廓桶状胸, position: (0, 44) [2022-07-28 16:15:34,250] [ INFO] - ----------------------------- [2022-07-28 16:15:34,250] [ INFO] - input data: 反复咳嗽、咳痰、活动后气促10年,再发加重3天 [2022-07-28 16:15:34,250] [ INFO] - detected entities and relations: [2022-07-28 16:15:34,250] [ INFO] - ----------------------------- [2022-07-28 16:15:34,250] [ INFO] - input data: 右侧腰痛2年余 [2022-07-28 16:15:34,250] [ INFO] - detected entities and relations: [2022-07-28 16:15:34,250] [ INFO] - -----------------------------
再请教个问题,如果要提取医学SPO实体关系,建议用 ernie-health 训练的模型,还是在 uie 模型上微调 ? 如果 uie , 采用 uie-tiny 还是 uie-medic 哪个效果更好?
具体效果取决于实际数据。
- ernie-health对训练数据量要求比较高,比如CMeIE有1.4万条左右,如果实际数据和CMeIE中抽取的schema类似,可以尝试混合训练或者微调好的模型来看看效果。
- uie在SPO设计上相对灵活,可以直接用taskflow调用
uie-base和uie-medical分别试下效果。如果效果不能达到预期,建议在uie-base上进行微调,此时也需要一定量的数据标注。
uie 模型,定义是用小样本微调,用大数据量是否能有效果?多少数据量是合适的? 我用150多标注数据(一个科室数据),训练 uie-tiny ,有一定效果,实体识别较好,关系抽取差些。 在 uie 模型中,标注数据是否要涵盖所有实体和关系?对于未学习过的实体能否推理出来
This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。
This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。