PaddleNLP [Question]: Taskflow 'information_extraction' 抽取不到大写英文对应的信息呢？

请提出你的问题

抽取英文发票内容过程中，OCR已经正确识别出各种发票信息了，比如： Invoice Nr: 000000000000264 Sum 70435.20 A TOTAL 70435.20

接下来使用'information_extraction'： from paddlenlp import Taskflow schema = ["Invoice Nr","Sum","TOTAL"] ie = Taskflow('information_extraction', schema_lang="en", ocr_lang="en", schema=schema)

结果只能抽取出小写字母对应的内容，大写的一概抽不出来： [2024-02-01 15:20:20,191] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'C:\Users\great.paddlenlp\taskflow\information_extraction\uie-base'. [{'Invoice Nr': [{'end': 123, 'probability': 0.8296216565117902, 'start': 108, 'text': '000000000000264'}], 'Sum': [{'end': 255, 'probability': 0.5992708411511529, 'start': 247, 'text': '70435.20'}]}]

进程已结束，退出代码为 0

请问如何解决？

Feb 01 '24 07:02 greatliu

若能确定是OCR抽取没有问题的话，看起来是需要通过数据标注的方式来提升效果，文档中有数据标注方式。https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/information_extraction/label_studio_doc.md

Feb 06 '24 05:02 wawltor

若能确定是OCR抽取没有问题的话，看起来是需要通过数据标注的方式来提升效果，文档中有数据标注方式。https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/information_extraction/label_studio_doc.md

收到，我学习一下。

Feb 07 '24 01:02 greatliu

若能确定是OCR抽取没有问题的话，看起来是需要通过数据标注的方式来提升效果，文档中有数据标注方式。https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/information_extraction/label_studio_doc.md

我看完了数据标注。请问这意思，是标注一些数据让我自己去训练吗？如果是的话，那么后面需要看哪几个文档？直到生成模型能调用为止。

刚入门，需要请教，谢谢。

Feb 17 '24 12:02 greatliu

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动，被标记为stale。

Apr 27 '24 00:04 github-actions[bot]

看这个总文档。https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/information_extraction/README.md

May 10 '24 12:05 w5688414

看这个总文档。https://github.com/PaddlePaddle/PaddleNLP/blob/develop/applications/information_extraction/README.md 谢谢你的回答，正好我需要做中文发票抽取，用uie抽取了发票的大部分数据。有几个小需求能不能解答一下： 1.有多个重名的key，比如买方和卖方都有“名称”这一项，我试了只能抽取到固定一个； 2.竖排文本能抽取到吗？比如“购买方信息”和“销售方信息”都是竖着写的； 3.中间有空格的key怎么抽？比如“金额”就抽取不到

May 11 '24 04:05 greatliu

推荐使用LLM+OCR的解决方案。https://aistudio.baidu.com/application/detail/7658

May 11 '24 06:05 w5688414

推荐使用LLM+OCR的解决方案。https://aistudio.baidu.com/application/detail/7658

给的这个链接是纯ocr吧

May 11 '24 08:05 greatliu

把ocr的结果放到文心一言大模型里面，使用prompt engineering进行抽取

May 11 '24 09:05 w5688414

把ocr的结果放到文心一言大模型里面，使用prompt engineering进行抽取

准备把这功能集成到mis系统中呢，要本地部署，而且尽量别用大模型太耗资源了。

May 12 '24 00:05 greatliu

哦，如果只有cpu的话，用小模型比较合适，或者大模型部署到云端，使用API的方式进行集成

May 12 '24 07:05 w5688414

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动，被标记为stale。

Jul 12 '24 00:07 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天，即将关闭。

Jul 27 '24 00:07 github-actions[bot]

PaddleNLP PaddleNLP copied to clipboard

[Question]: Taskflow 'information_extraction' 抽取不到 大写英文 对应的信息呢？

请提出你的问题

PaddleNLP
PaddleNLP copied to clipboard

[Question]: Taskflow 'information_extraction' 抽取不到大写英文对应的信息呢？