KAG 测试pdf文件，发现num_nodes，num_edges和num_subgraghs是0，是我操作有问题吗？

提供的例子里的文档格式是json，csv，txt的，我想测一下pdf的，请问应该怎么操作

Feb 25 '25 07:02 liyubo-debug

https://github.com/NanGePlus/KagTest/tree/main/KagV6Test/XiYouJiTest_KAG_V6 上面这个例子里有pdf文件

Feb 25 '25 10:02 zzyyll2

https://github.com/NanGePlus/KagTest/tree/main/KagV6Test/XiYouJiTest_KAG_V6 上面这个例子里有pdf文件

谢谢指路！

Feb 25 '25 11:02 liyubo-debug

根据上面地址中给的例子进行测试，发现pdf文本的"graph_stat": {"num_nodes": 0, "num_edges": 0, "num_subgraphs": 0}，docx和md格式文件上述参数是有数值的，为什么pdf是0？

另外我上传自己的pdf测试时出现了以下报错 INFO:kag.interface.common.llm_client:Error 'name' during invocation: Traceback (most recent call last): File "/home/xxx/project/KAG/kag/interface/common/llm_client.py", line 110, in invoke result = prompt_op.parse_response(response, model=self.model, **variables) File "/home/xxx/project/KAG/kag/builder/prompt/default/std.py", line 134, in parse_response entities_with_offical_name.add(entity["name"]) **KeyError: 'name'**

raceback (most recent call last): File "/home/xxx/project/KAG/kag/builder/runner.py", line 146, in process result = self.chain.invoke( File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 190, in invoke ret = inner_future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 157, in run_extract flow_data = execute_node(node, flow_data, key=input_key) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 143, in execute_node node_output.extend(node.invoke(item, **kwargs)) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 120, in invoke input = self.standarlize_graph(input) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 91, in standarlize_graph node.properties[k] = json.dumps(v, ensure_ascii=False) File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps **kw).encode(obj) File "/usr/lib/python3.10/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python3.10/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/usr/lib/python3.10/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.__class__.__name__} ' **TypeError: Object of type Chunk is not JSON serializable**

看报错大概是 Chunk 类型的对象不可 JSON 序列化。我的pdf中有图片和表格，难道是不支持，还是其他原因呢？求大佬指点！

Feb 27 '25 06:02 liyubo-debug

根据上面地址中给的例子进行测试，发现pdf文本的"graph_stat": {"num_nodes": 0, "num_edges": 0, "num_subgraphs": 0}，docx和md格式文件上述参数是有数值的，为什么pdf是0？

另外我上传自己的pdf测试时出现了以下报错 INFO:kag.interface.common.llm_client:Error 'name' during invocation: Traceback (most recent call last): File "/home/xxx/project/KAG/kag/interface/common/llm_client.py", line 110, in invoke result = prompt_op.parse_response(response, model=self.model, **variables) File "/home/xxx/project/KAG/kag/builder/prompt/default/std.py", line 134, in parse_response entities_with_offical_name.add(entity["name"]) **KeyError: 'name'**

raceback (most recent call last): File "/home/xxx/project/KAG/kag/builder/runner.py", line 146, in process result = self.chain.invoke( File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 190, in invoke ret = inner_future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 157, in run_extract flow_data = execute_node(node, flow_data, key=input_key) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 143, in execute_node node_output.extend(node.invoke(item, **kwargs)) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 120, in invoke input = self.standarlize_graph(input) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 91, in standarlize_graph node.properties[k] = json.dumps(v, ensure_ascii=False) File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps **kw).encode(obj) File "/usr/lib/python3.10/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python3.10/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/usr/lib/python3.10/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.__class__.__name__} ' **TypeError: Object of type Chunk is not JSON serializable**

看报错大概是 Chunk 类型的对象不可 JSON 序列化。我的pdf中有图片和表格，难道是不支持，还是其他原因呢？求大佬指点！

You can use MinerU or marker to transform pdf into markdown format, and then load md file into KAG.

Feb 28 '25 01:02 caszkgui

根据上面地址中给的例子进行测试，发现pdf文本的"graph_stat": {"num_nodes": 0, "num_edges": 0, "num_subgraphs": 0}，docx和md格式文件上述参数是有数值的，为什么pdf是0？

另外我上传自己的pdf测试时出现了以下报错 INFO:kag.interface.common.llm_client:Error 'name' during invocation: Traceback (most recent call last): File "/home/xxx/project/KAG/kag/interface/common/llm_client.py", line 110, in invoke result = prompt_op.parse_response(response, model=self.model, **variables) File "/home/xxx/project/KAG/kag/builder/prompt/default/std.py", line 134, in parse_response entities_with_offical_name.add(entity["name"]) **KeyError: 'name'**

raceback (most recent call last): File "/home/xxx/project/KAG/kag/builder/runner.py", line 146, in process result = self.chain.invoke( File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 190, in invoke ret = inner_future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 157, in run_extract flow_data = execute_node(node, flow_data, key=input_key) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 143, in execute_node node_output.extend(node.invoke(item, **kwargs)) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 120, in invoke input = self.standarlize_graph(input) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 91, in standarlize_graph node.properties[k] = json.dumps(v, ensure_ascii=False) File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps **kw).encode(obj) File "/usr/lib/python3.10/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python3.10/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/usr/lib/python3.10/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.__class__.__name__} ' **TypeError: Object of type Chunk is not JSON serializable**

看报错大概是 Chunk 类型的对象不可 JSON 序列化。我的pdf中有图片和表格，难道是不支持，还是其他原因呢？求大佬指点！

遇到相同的问题，读取pdf文件，nodes和edges都是0

Feb 28 '25 09:02 yzhbreeze

这个解析txt可以，解析大的pdf不是向量化就是分块慢，然后出不来

Feb 28 '25 13:02 Dongyexixue

是的，解析pdf特别慢就没有成功，我把pdf转为md，有些md文件的nodes和edges有数值，有些就还是0，再转为txt文件是可以解析，但是变成txt文件的话就丢失了图片和表格了。

Mar 03 '25 01:03 liyubo-debug

请问你们在把pdf转为txt后抽取效果怎么样呢？能截图一张我看一下你们的效果吗？ Could you please tell me how is the extraction effect after converting PDF to TXT? Can you show me a screenshot of your effect?

Mar 06 '25 05:03 zhulin-acad

是的，解析pdf特别慢就没有成功，我把pdf转为md，有些md文件的nodes和edges有数值，有些就还是0，再转为txt文件是可以解析，但是变成txt文件的话就丢失了图片和表格了。

MinerU在提取表格时，保存为html格式，还需要继续处理一下。但是有些表格提取不了，就没办法了

Mar 06 '25 06:03 yzhbreeze

是的，解析pdf特别慢就没有成功，我把pdf转为md，有些md文件的nodes和edges有数值，有些就还是0，再转为txt文件是可以解析，但是变成txt文件的话就丢失了图片和表格了。

MinerU在提取表格时，保存为html格式，还需要继续处理一下。但是有些表格提取不了，就没办法了

KAG的核心在plan/reasoner，对文件解析不擅长。如果是生产环境，建议用WPS API之类的第三方服务做文件格式转换。解析这块费时费力，不建议重复造轮子。

Mar 06 '25 06:03 thundax-lyp