KAG icon indicating copy to clipboard operation
KAG copied to clipboard

测试pdf文件,发现num_nodes,num_edges和num_subgraghs是0,是我操作有问题吗?

Open liyubo-debug opened this issue 10 months ago • 10 comments

提供的例子里的文档格式是json,csv,txt的,我想测一下pdf的,请问应该怎么操作

liyubo-debug avatar Feb 25 '25 07:02 liyubo-debug

https://github.com/NanGePlus/KagTest/tree/main/KagV6Test/XiYouJiTest_KAG_V6 上面这个例子里有pdf文件

zzyyll2 avatar Feb 25 '25 10:02 zzyyll2

https://github.com/NanGePlus/KagTest/tree/main/KagV6Test/XiYouJiTest_KAG_V6 上面这个例子里有pdf文件

谢谢指路!

liyubo-debug avatar Feb 25 '25 11:02 liyubo-debug

根据上面地址中给的例子进行测试,发现pdf文本的"graph_stat": {"num_nodes": 0, "num_edges": 0, "num_subgraphs": 0},docx和md格式文件上述参数是有数值的,为什么pdf是0?

另外我上传自己的pdf测试时出现了以下报错 INFO:kag.interface.common.llm_client:Error 'name' during invocation: Traceback (most recent call last): File "/home/xxx/project/KAG/kag/interface/common/llm_client.py", line 110, in invoke result = prompt_op.parse_response(response, model=self.model, **variables) File "/home/xxx/project/KAG/kag/builder/prompt/default/std.py", line 134, in parse_response entities_with_offical_name.add(entity["name"]) **KeyError: 'name'**

raceback (most recent call last): File "/home/xxx/project/KAG/kag/builder/runner.py", line 146, in process result = self.chain.invoke( File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 190, in invoke ret = inner_future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 157, in run_extract flow_data = execute_node(node, flow_data, key=input_key) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 143, in execute_node node_output.extend(node.invoke(item, **kwargs)) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 120, in invoke input = self.standarlize_graph(input) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 91, in standarlize_graph node.properties[k] = json.dumps(v, ensure_ascii=False) File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps **kw).encode(obj) File "/usr/lib/python3.10/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python3.10/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/usr/lib/python3.10/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.__class__.__name__} ' **TypeError: Object of type Chunk is not JSON serializable**

看报错大概是 Chunk 类型的对象不可 JSON 序列化。 我的pdf中有图片和表格,难道是不支持,还是其他原因呢?求大佬指点!

liyubo-debug avatar Feb 27 '25 06:02 liyubo-debug

根据上面地址中给的例子进行测试,发现pdf文本的"graph_stat": {"num_nodes": 0, "num_edges": 0, "num_subgraphs": 0},docx和md格式文件上述参数是有数值的,为什么pdf是0?

另外我上传自己的pdf测试时出现了以下报错 INFO:kag.interface.common.llm_client:Error 'name' during invocation: Traceback (most recent call last): File "/home/xxx/project/KAG/kag/interface/common/llm_client.py", line 110, in invoke result = prompt_op.parse_response(response, model=self.model, **variables) File "/home/xxx/project/KAG/kag/builder/prompt/default/std.py", line 134, in parse_response entities_with_offical_name.add(entity["name"]) **KeyError: 'name'**

raceback (most recent call last): File "/home/xxx/project/KAG/kag/builder/runner.py", line 146, in process result = self.chain.invoke( File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 190, in invoke ret = inner_future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 157, in run_extract flow_data = execute_node(node, flow_data, key=input_key) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 143, in execute_node node_output.extend(node.invoke(item, **kwargs)) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 120, in invoke input = self.standarlize_graph(input) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 91, in standarlize_graph node.properties[k] = json.dumps(v, ensure_ascii=False) File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps **kw).encode(obj) File "/usr/lib/python3.10/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python3.10/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/usr/lib/python3.10/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.__class__.__name__} ' **TypeError: Object of type Chunk is not JSON serializable**

看报错大概是 Chunk 类型的对象不可 JSON 序列化。 我的pdf中有图片和表格,难道是不支持,还是其他原因呢?求大佬指点!

You can use MinerU or marker to transform pdf into markdown format, and then load md file into KAG.

caszkgui avatar Feb 28 '25 01:02 caszkgui

根据上面地址中给的例子进行测试,发现pdf文本的"graph_stat": {"num_nodes": 0, "num_edges": 0, "num_subgraphs": 0},docx和md格式文件上述参数是有数值的,为什么pdf是0?

另外我上传自己的pdf测试时出现了以下报错 INFO:kag.interface.common.llm_client:Error 'name' during invocation: Traceback (most recent call last): File "/home/xxx/project/KAG/kag/interface/common/llm_client.py", line 110, in invoke result = prompt_op.parse_response(response, model=self.model, **variables) File "/home/xxx/project/KAG/kag/builder/prompt/default/std.py", line 134, in parse_response entities_with_offical_name.add(entity["name"]) **KeyError: 'name'**

raceback (most recent call last): File "/home/xxx/project/KAG/kag/builder/runner.py", line 146, in process result = self.chain.invoke( File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 190, in invoke ret = inner_future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 157, in run_extract flow_data = execute_node(node, flow_data, key=input_key) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 143, in execute_node node_output.extend(node.invoke(item, **kwargs)) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 120, in invoke input = self.standarlize_graph(input) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 91, in standarlize_graph node.properties[k] = json.dumps(v, ensure_ascii=False) File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps **kw).encode(obj) File "/usr/lib/python3.10/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python3.10/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/usr/lib/python3.10/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.__class__.__name__} ' **TypeError: Object of type Chunk is not JSON serializable**

看报错大概是 Chunk 类型的对象不可 JSON 序列化。 我的pdf中有图片和表格,难道是不支持,还是其他原因呢?求大佬指点!

遇到相同的问题,读取pdf文件,nodes和edges都是0

yzhbreeze avatar Feb 28 '25 09:02 yzhbreeze

这个解析txt可以,解析大的pdf不是向量化就是分块慢,然后出不来

Dongyexixue avatar Feb 28 '25 13:02 Dongyexixue

是的,解析pdf特别慢就没有成功,我把pdf转为md,有些md文件的nodes和edges有数值,有些就还是0,再转为txt文件是可以解析,但是变成txt文件的话就丢失了图片和表格了。

liyubo-debug avatar Mar 03 '25 01:03 liyubo-debug

请问你们在把pdf转为txt后抽取效果怎么样呢?能截图一张我看一下你们的效果吗? Could you please tell me how is the extraction effect after converting PDF to TXT? Can you show me a screenshot of your effect?

zhulin-acad avatar Mar 06 '25 05:03 zhulin-acad

是的,解析pdf特别慢就没有成功,我把pdf转为md,有些md文件的nodes和edges有数值,有些就还是0,再转为txt文件是可以解析,但是变成txt文件的话就丢失了图片和表格了。

MinerU在提取表格时,保存为html格式,还需要继续处理一下。但是有些表格提取不了,就没办法了

yzhbreeze avatar Mar 06 '25 06:03 yzhbreeze

是的,解析pdf特别慢就没有成功,我把pdf转为md,有些md文件的nodes和edges有数值,有些就还是0,再转为txt文件是可以解析,但是变成txt文件的话就丢失了图片和表格了。

MinerU在提取表格时,保存为html格式,还需要继续处理一下。但是有些表格提取不了,就没办法了

KAG的核心在plan/reasoner,对文件解析不擅长。如果是生产环境,建议用WPS API之类的第三方服务做文件格式转换。解析这块费时费力,不建议重复造轮子。

thundax-lyp avatar Mar 06 '25 06:03 thundax-lyp