KAG icon indicating copy to clipboard operation
KAG copied to clipboard

[Bug] [Indexing] Keep reporting JSONDecodeError

Open Dormiveglia-elf opened this issue 11 months ago • 11 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

Operating system information

Linux

What happened

ERROR:kag.interface.common.llm_client:Error Expecting value: line 1 column 1 (char 0) during invocation: Traceback (most recent call last):
  File "/KAG/kag/interface/common/llm_client.py", line 107, in invoke
    result = prompt_op.parse_response(response, model=self.model, **variables)
  File "/KAG/kag/builder/prompt/default/triple.py", line 189, in parse_response
    rsp = json.loads(rsp)
  File "//miniconda3/envs/kag/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/miniconda3/envs/kag/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/miniconda3/envs/kag/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I notice that txt file has no format restrict, but still report this error when indexing

How to reproduce

  1. scanner -> file_scanner
  2. reader: type: txt
  3. python indexer.py

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Dormiveglia-elf avatar Jan 15 '25 09:01 Dormiveglia-elf

LLM: DeepSeek Embedding Model: text-embedding-3-small

Dormiveglia-elf avatar Jan 15 '25 14:01 Dormiveglia-elf

@xionghuaidong Please have a look, I tried to re-format txt files but still failed, reporting invalid json

Dormiveglia-elf avatar Jan 15 '25 15:01 Dormiveglia-elf

@caszkgui Could you please have a look? I tried to re-format txt files but still failed, reporting invalid json. Additionally, sometimes I get embedding error since the length exceed 8192 (which is the max token accepted by text-3-embedding-small openai).

Dormiveglia-elf avatar Jan 16 '25 04:01 Dormiveglia-elf

The reason is that when extracting triples, the data returned by LLM is not in the correct JSON format. You can try customizing the triple_prompt based on your data. Please refer to the documentation: https://openspg.yuque.com/ndx6g9/docs_en/orwiw49glgg6gebx

zhuzhongshu123 avatar Jan 17 '25 07:01 zhuzhongshu123

The reason is that when extracting triples, the data returned by LLM is not in the correct JSON format. You can try customizing the triple_prompt based on your data. Please refer to the documentation: https://openspg.yuque.com/ndx6g9/docs_en/orwiw49glgg6gebx

Thank you for your response. Actually, I am trying to use KAG to test some open-benchmark-dataset. I don't think it is reasonable to let users to do prompt engineering themselves. And then I think you team should modify the official doc since the doc saying that KAG accept txt files with any format, which is not the fact.

Dormiveglia-elf avatar Jan 17 '25 07:01 Dormiveglia-elf

The reason is that when extracting triples, the data returned by LLM is not in the correct JSON format. You can try customizing the triple_prompt based on your data. Please refer to the documentation: https://openspg.yuque.com/ndx6g9/docs_en/orwiw49glgg6gebx

Thank you for your response. Actually, I am trying to use KAG to test some open-benchmark-dataset. I don't think it is reasonable to let users to do prompt engineering themselves. And then I think you team should modify the official doc since the doc saying that KAG accept txt files with any format, which is not the fact.

Have you modified the split_length config of the splitter? The default value in the public dataset example (2wiki, musique and hotpotqa) is 100,000 because these datasets are already splitted. For a single large file, it is recommended to set a smaller value, such as 1,000.

zhuzhongshu123 avatar Jan 17 '25 07:01 zhuzhongshu123

The reason is that when extracting triples, the data returned by LLM is not in the correct JSON format. You can try customizing the triple_prompt based on your data. Please refer to the documentation: https://openspg.yuque.com/ndx6g9/docs_en/orwiw49glgg6gebx

Thank you for your response. Actually, I am trying to use KAG to test some open-benchmark-dataset. I don't think it is reasonable to let users to do prompt engineering themselves. And then I think you team should modify the official doc since the doc saying that KAG accept txt files with any format, which is not the fact.

Have you modified the split_length config of the splitter? The default value in the public dataset example (2wiki, musique and hotpotqa) is 100,000 because these datasets are already splitted. For a single large file, it is recommended to set a smaller value, such as 1,000.

I have tried this for sure, still not working, and the indexing speed is so so so slow though I use DeepSeek V3 and 8-thread

Dormiveglia-elf avatar Jan 17 '25 11:01 Dormiveglia-elf

@Dormiveglia-elf @zhuzhongshu123 So that means I have to take the txt text and convert it to the standard ternary form in the document to build it???

yijieLiu1 avatar Jan 20 '25 11:01 yijieLiu1

If it is product mode, how can this problem be solved

z-x-x136 avatar Feb 18 '25 01:02 z-x-x136

If it is product mode, how can this problem be solved

z-x-x136 avatar Feb 18 '25 01:02 z-x-x136

我也遇到了同样的异常输出,0.6和0.7版都有,经调试是LLM输出超长被截断所致,通过修改源代码——在调用LLM时传入参数max_tokens解决:

Image

Image 从response截取json串时也应处理被截断情况:

Image

unrealise avatar Apr 25 '25 06:04 unrealise

我也遇到了同样的异常输出,0.6和0.7版都有,经调试是LLM输出超长被截断所致,通过修改源代码——在调用LLM时传入参数max_tokens解决:

Image

Image 从response截取json串时也应处理被截断情况:

Image

Developers can adjust max_tokens of LLM in kag_config.yaml:

Image

caszkgui avatar Aug 16 '25 00:08 caszkgui