测试pdf文件,发现num_nodes,num_edges和num_subgraghs是0,是我操作有问题吗?
提供的例子里的文档格式是json,csv,txt的,我想测一下pdf的,请问应该怎么操作
https://github.com/NanGePlus/KagTest/tree/main/KagV6Test/XiYouJiTest_KAG_V6 上面这个例子里有pdf文件
https://github.com/NanGePlus/KagTest/tree/main/KagV6Test/XiYouJiTest_KAG_V6 上面这个例子里有pdf文件
谢谢指路!
根据上面地址中给的例子进行测试,发现pdf文本的"graph_stat": {"num_nodes": 0, "num_edges": 0, "num_subgraphs": 0},docx和md格式文件上述参数是有数值的,为什么pdf是0?
另外我上传自己的pdf测试时出现了以下报错
INFO:kag.interface.common.llm_client:Error 'name' during invocation: Traceback (most recent call last): File "/home/xxx/project/KAG/kag/interface/common/llm_client.py", line 110, in invoke result = prompt_op.parse_response(response, model=self.model, **variables) File "/home/xxx/project/KAG/kag/builder/prompt/default/std.py", line 134, in parse_response entities_with_offical_name.add(entity["name"]) **KeyError: 'name'**
raceback (most recent call last): File "/home/xxx/project/KAG/kag/builder/runner.py", line 146, in process result = self.chain.invoke( File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 190, in invoke ret = inner_future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 157, in run_extract flow_data = execute_node(node, flow_data, key=input_key) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 143, in execute_node node_output.extend(node.invoke(item, **kwargs)) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 120, in invoke input = self.standarlize_graph(input) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 91, in standarlize_graph node.properties[k] = json.dumps(v, ensure_ascii=False) File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps **kw).encode(obj) File "/usr/lib/python3.10/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python3.10/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/usr/lib/python3.10/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.__class__.__name__} ' **TypeError: Object of type Chunk is not JSON serializable**
看报错大概是 Chunk 类型的对象不可 JSON 序列化。 我的pdf中有图片和表格,难道是不支持,还是其他原因呢?求大佬指点!
根据上面地址中给的例子进行测试,发现pdf文本的"graph_stat": {"num_nodes": 0, "num_edges": 0, "num_subgraphs": 0},docx和md格式文件上述参数是有数值的,为什么pdf是0?
另外我上传自己的pdf测试时出现了以下报错
INFO:kag.interface.common.llm_client:Error 'name' during invocation: Traceback (most recent call last): File "/home/xxx/project/KAG/kag/interface/common/llm_client.py", line 110, in invoke result = prompt_op.parse_response(response, model=self.model, **variables) File "/home/xxx/project/KAG/kag/builder/prompt/default/std.py", line 134, in parse_response entities_with_offical_name.add(entity["name"]) **KeyError: 'name'**
raceback (most recent call last): File "/home/xxx/project/KAG/kag/builder/runner.py", line 146, in process result = self.chain.invoke( File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 190, in invoke ret = inner_future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 157, in run_extract flow_data = execute_node(node, flow_data, key=input_key) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 143, in execute_node node_output.extend(node.invoke(item, **kwargs)) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 120, in invoke input = self.standarlize_graph(input) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 91, in standarlize_graph node.properties[k] = json.dumps(v, ensure_ascii=False) File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps **kw).encode(obj) File "/usr/lib/python3.10/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python3.10/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/usr/lib/python3.10/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.__class__.__name__} ' **TypeError: Object of type Chunk is not JSON serializable**看报错大概是 Chunk 类型的对象不可 JSON 序列化。 我的pdf中有图片和表格,难道是不支持,还是其他原因呢?求大佬指点!
You can use MinerU or marker to transform pdf into markdown format, and then load md file into KAG.
根据上面地址中给的例子进行测试,发现pdf文本的"graph_stat": {"num_nodes": 0, "num_edges": 0, "num_subgraphs": 0},docx和md格式文件上述参数是有数值的,为什么pdf是0?
另外我上传自己的pdf测试时出现了以下报错
INFO:kag.interface.common.llm_client:Error 'name' during invocation: Traceback (most recent call last): File "/home/xxx/project/KAG/kag/interface/common/llm_client.py", line 110, in invoke result = prompt_op.parse_response(response, model=self.model, **variables) File "/home/xxx/project/KAG/kag/builder/prompt/default/std.py", line 134, in parse_response entities_with_offical_name.add(entity["name"]) **KeyError: 'name'**
raceback (most recent call last): File "/home/xxx/project/KAG/kag/builder/runner.py", line 146, in process result = self.chain.invoke( File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 190, in invoke ret = inner_future.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 157, in run_extract flow_data = execute_node(node, flow_data, key=input_key) File "/home/xxx/project/KAG/kag/builder/default_chain.py", line 143, in execute_node node_output.extend(node.invoke(item, **kwargs)) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 120, in invoke input = self.standarlize_graph(input) File "/home/xxx/project/KAG/kag/builder/component/writer/kg_writer.py", line 91, in standarlize_graph node.properties[k] = json.dumps(v, ensure_ascii=False) File "/usr/lib/python3.10/json/__init__.py", line 238, in dumps **kw).encode(obj) File "/usr/lib/python3.10/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python3.10/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/usr/lib/python3.10/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.__class__.__name__} ' **TypeError: Object of type Chunk is not JSON serializable**看报错大概是 Chunk 类型的对象不可 JSON 序列化。 我的pdf中有图片和表格,难道是不支持,还是其他原因呢?求大佬指点!
遇到相同的问题,读取pdf文件,nodes和edges都是0
这个解析txt可以,解析大的pdf不是向量化就是分块慢,然后出不来
是的,解析pdf特别慢就没有成功,我把pdf转为md,有些md文件的nodes和edges有数值,有些就还是0,再转为txt文件是可以解析,但是变成txt文件的话就丢失了图片和表格了。
请问你们在把pdf转为txt后抽取效果怎么样呢?能截图一张我看一下你们的效果吗? Could you please tell me how is the extraction effect after converting PDF to TXT? Can you show me a screenshot of your effect?
是的,解析pdf特别慢就没有成功,我把pdf转为md,有些md文件的nodes和edges有数值,有些就还是0,再转为txt文件是可以解析,但是变成txt文件的话就丢失了图片和表格了。
MinerU在提取表格时,保存为html格式,还需要继续处理一下。但是有些表格提取不了,就没办法了
是的,解析pdf特别慢就没有成功,我把pdf转为md,有些md文件的nodes和edges有数值,有些就还是0,再转为txt文件是可以解析,但是变成txt文件的话就丢失了图片和表格了。
MinerU在提取表格时,保存为html格式,还需要继续处理一下。但是有些表格提取不了,就没办法了
KAG的核心在plan/reasoner,对文件解析不擅长。如果是生产环境,建议用WPS API之类的第三方服务做文件格式转换。解析这块费时费力,不建议重复造轮子。