[Bug] [Indexing] Keep reporting JSONDecodeError
Search before asking
- [X] I had searched in the issues and found no similar issues.
Operating system information
Linux
What happened
ERROR:kag.interface.common.llm_client:Error Expecting value: line 1 column 1 (char 0) during invocation: Traceback (most recent call last):
File "/KAG/kag/interface/common/llm_client.py", line 107, in invoke
result = prompt_op.parse_response(response, model=self.model, **variables)
File "/KAG/kag/builder/prompt/default/triple.py", line 189, in parse_response
rsp = json.loads(rsp)
File "//miniconda3/envs/kag/lib/python3.10/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/miniconda3/envs/kag/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/miniconda3/envs/kag/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
I notice that txt file has no format restrict, but still report this error when indexing
How to reproduce
- scanner -> file_scanner
- reader: type: txt
- python indexer.py
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
LLM: DeepSeek Embedding Model: text-embedding-3-small
@xionghuaidong Please have a look, I tried to re-format txt files but still failed, reporting invalid json
@caszkgui Could you please have a look? I tried to re-format txt files but still failed, reporting invalid json. Additionally, sometimes I get embedding error since the length exceed 8192 (which is the max token accepted by text-3-embedding-small openai).
The reason is that when extracting triples, the data returned by LLM is not in the correct JSON format. You can try customizing the triple_prompt based on your data. Please refer to the documentation: https://openspg.yuque.com/ndx6g9/docs_en/orwiw49glgg6gebx
The reason is that when extracting triples, the data returned by LLM is not in the correct JSON format. You can try customizing the triple_prompt based on your data. Please refer to the documentation: https://openspg.yuque.com/ndx6g9/docs_en/orwiw49glgg6gebx
Thank you for your response. Actually, I am trying to use KAG to test some open-benchmark-dataset. I don't think it is reasonable to let users to do prompt engineering themselves. And then I think you team should modify the official doc since the doc saying that KAG accept txt files with any format, which is not the fact.
The reason is that when extracting triples, the data returned by LLM is not in the correct JSON format. You can try customizing the triple_prompt based on your data. Please refer to the documentation: https://openspg.yuque.com/ndx6g9/docs_en/orwiw49glgg6gebx
Thank you for your response. Actually, I am trying to use KAG to test some open-benchmark-dataset. I don't think it is reasonable to let users to do prompt engineering themselves. And then I think you team should modify the official doc since the doc saying that KAG accept txt files with any format, which is not the fact.
Have you modified the split_length config of the splitter? The default value in the public dataset example (2wiki, musique and hotpotqa) is 100,000 because these datasets are already splitted. For a single large file, it is recommended to set a smaller value, such as 1,000.
The reason is that when extracting triples, the data returned by LLM is not in the correct JSON format. You can try customizing the triple_prompt based on your data. Please refer to the documentation: https://openspg.yuque.com/ndx6g9/docs_en/orwiw49glgg6gebx
Thank you for your response. Actually, I am trying to use KAG to test some open-benchmark-dataset. I don't think it is reasonable to let users to do prompt engineering themselves. And then I think you team should modify the official doc since the doc saying that KAG accept txt files with any format, which is not the fact.
Have you modified the
split_lengthconfig of the splitter? The default value in the public dataset example (2wiki, musique and hotpotqa) is 100,000 because these datasets are already splitted. For a single large file, it is recommended to set a smaller value, such as 1,000.
I have tried this for sure, still not working, and the indexing speed is so so so slow though I use DeepSeek V3 and 8-thread
@Dormiveglia-elf @zhuzhongshu123 So that means I have to take the txt text and convert it to the standard ternary form in the document to build it???
If it is product mode, how can this problem be solved
If it is product mode, how can this problem be solved
我也遇到了同样的异常输出,0.6和0.7版都有,经调试是LLM输出超长被截断所致,通过修改源代码——在调用LLM时传入参数max_tokens解决:
从response截取json串时也应处理被截断情况:
我也遇到了同样的异常输出,0.6和0.7版都有,经调试是LLM输出超长被截断所致,通过修改源代码——在调用LLM时传入参数max_tokens解决:
从response截取json串时也应处理被截断情况:
Developers can adjust max_tokens of LLM in kag_config.yaml: