KAG [Bug] [Indexing] Keep reporting JSONDecodeError

Search before asking

[X] I had searched in the issues and found no similar issues.

Operating system information

Linux

What happened

ERROR:kag.interface.common.llm_client:Error Expecting value: line 1 column 1 (char 0) during invocation: Traceback (most recent call last):
  File "/KAG/kag/interface/common/llm_client.py", line 107, in invoke
    result = prompt_op.parse_response(response, model=self.model, **variables)
  File "/KAG/kag/builder/prompt/default/triple.py", line 189, in parse_response
    rsp = json.loads(rsp)
  File "//miniconda3/envs/kag/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/miniconda3/envs/kag/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/miniconda3/envs/kag/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I notice that txt file has no format restrict, but still report this error when indexing

How to reproduce

scanner -> file_scanner
reader: type: txt
python indexer.py

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

Jan 15 '25 09:01 Dormiveglia-elf

LLM: DeepSeek Embedding Model: text-embedding-3-small

Jan 15 '25 14:01 Dormiveglia-elf

@xionghuaidong Please have a look, I tried to re-format txt files but still failed, reporting invalid json

Jan 15 '25 15:01 Dormiveglia-elf

@caszkgui Could you please have a look? I tried to re-format txt files but still failed, reporting invalid json. Additionally, sometimes I get embedding error since the length exceed 8192 (which is the max token accepted by text-3-embedding-small openai).

Jan 16 '25 04:01 Dormiveglia-elf

The reason is that when extracting triples, the data returned by LLM is not in the correct JSON format. You can try customizing the triple_prompt based on your data. Please refer to the documentation: https://openspg.yuque.com/ndx6g9/docs_en/orwiw49glgg6gebx

Jan 17 '25 07:01 zhuzhongshu123

The reason is that when extracting triples, the data returned by LLM is not in the correct JSON format. You can try customizing the triple_prompt based on your data. Please refer to the documentation: https://openspg.yuque.com/ndx6g9/docs_en/orwiw49glgg6gebx

Thank you for your response. Actually, I am trying to use KAG to test some open-benchmark-dataset. I don't think it is reasonable to let users to do prompt engineering themselves. And then I think you team should modify the official doc since the doc saying that KAG accept txt files with any format, which is not the fact.

Jan 17 '25 07:01 Dormiveglia-elf

The reason is that when extracting triples, the data returned by LLM is not in the correct JSON format. You can try customizing the triple_prompt based on your data. Please refer to the documentation: https://openspg.yuque.com/ndx6g9/docs_en/orwiw49glgg6gebx

Thank you for your response. Actually, I am trying to use KAG to test some open-benchmark-dataset. I don't think it is reasonable to let users to do prompt engineering themselves. And then I think you team should modify the official doc since the doc saying that KAG accept txt files with any format, which is not the fact.

Have you modified the split_length config of the splitter? The default value in the public dataset example (2wiki, musique and hotpotqa) is 100,000 because these datasets are already splitted. For a single large file, it is recommended to set a smaller value, such as 1,000.

Jan 17 '25 07:01 zhuzhongshu123

The reason is that when extracting triples, the data returned by LLM is not in the correct JSON format. You can try customizing the triple_prompt based on your data. Please refer to the documentation: https://openspg.yuque.com/ndx6g9/docs_en/orwiw49glgg6gebx

Thank you for your response. Actually, I am trying to use KAG to test some open-benchmark-dataset. I don't think it is reasonable to let users to do prompt engineering themselves. And then I think you team should modify the official doc since the doc saying that KAG accept txt files with any format, which is not the fact.

Have you modified the split_length config of the splitter? The default value in the public dataset example (2wiki, musique and hotpotqa) is 100,000 because these datasets are already splitted. For a single large file, it is recommended to set a smaller value, such as 1,000.

I have tried this for sure, still not working, and the indexing speed is so so so slow though I use DeepSeek V3 and 8-thread

Jan 17 '25 11:01 Dormiveglia-elf

@Dormiveglia-elf @zhuzhongshu123 So that means I have to take the txt text and convert it to the standard ternary form in the document to build it???