KAG icon indicating copy to clipboard operation
KAG copied to clipboard

知识抽取出现问题了,怎么解决

Open Phoebe246824 opened this issue 2 months ago • 2 comments

Traceback (most recent call last): File "/mnt/SSD/home/zxy24/KAG/kag/builder/prompt/default/util.py", line 180, in check_data info = json.loads( File "/mnt/SSD/home/zxy24/anaconda3/envs/openspg/lib/python3.10/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/mnt/SSD/home/zxy24/anaconda3/envs/openspg/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/mnt/SSD/home/zxy24/anaconda3/envs/openspg/lib/python3.10/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 2 (char 1) 2025-10-16 21:34:50 - ERROR - root - Failed to process data {'id': 'Alû', 'name': 'Alû', 'content': 'In Akkadia'}, info: Traceback (most recent call last): File "/mnt/SSD/home/zxy24/KAG/kag/builder/runner.py", line 207, in process result = await self.chain.ainvoke(data) File "/mnt/SSD/home/zxy24/KAG/kag/interface/builder/builder_chain_abc.py", line 164, in ainvoke outputs = await asyncio.gather(*tasks) File "/mnt/SSD/home/zxy24/KAG/kag/interface/builder/builder_chain_abc.py", line 134, in execute_node results = await asyncio.gather(*tasks) File "/mnt/SSD/home/zxy24/KAG/kag/interface/builder/builder_chain_abc.py", line 126, in ainvoke_with_semaphore return await node.ainvoke(item) File "/mnt/SSD/home/zxy24/KAG/kag/interface/builder/base.py", line 215, in ainvoke output = await self._ainvoke(input_data, **kwargs) File "/mnt/SSD/home/zxy24/KAG/kag/builder/component/extractor/knowledge_unit_extractor.py", line 671, in _ainvoke knowledge_unit_nodes = self.assemble_knowledge_unit( File "/mnt/SSD/home/zxy24/KAG/kag/builder/component/extractor/knowledge_unit_extractor.py", line 587, in assemble_knowledge_unit for item in knowledge_value.get("core_entities", "").split(","): AttributeError: 'dict' object has no attribute 'split' 100%|██████████████████████████████████████████████████| 8/8 [02:57<00:00, 22.23s/it] Done process 8 records, with 0 found in checkpoint, 0 successfully processed and 8 failures encountered. The log file is located at ckpt/kag_checkpoint_0_1.ckpt. Please access this file to obtain detailed task statistics. 2025-10-16 21:34:50 - INFO - main -

buildKB successfully for /mnt/SSD/home/zxy24/KAG/kag/examples/HotpotQATest/builder/data/sub_corpus.json

Phoebe246824 avatar Oct 16 '25 14:10 Phoebe246824

JSONDecodeError("Expecting value", s, err.value) from None,可能是模型没连上,调用模型的时候没有得到数据,也可能是提示词等问题导致模型返回的数据有问题,检查一下能不能连上模型

Kaedeser avatar Oct 26 '25 08:10 Kaedeser

🎯 Solution Implemented

I've analyzed and fixed this issue. The problem was that the core_entities field returned by the LLM can come in two different formats:

  1. String format (typically Chinese): "核心实体": "火电发电量,同比增长率,2019年"
  2. Dict format (typically English): "Core Entities": {"T.I.": "Person", "No Mediocre": "Culture and Entertainment"}

The code was only handling the string format and trying to call .split(",") on the value, which caused the AttributeError: 'dict' object has no attribute 'split'.

Fix Applied

Modified kag/builder/component/extractor/knowledge_unit_extractor.py to handle both formats gracefully with proper type checking and error logging.

Pull Request

The fix has been submitted in PR #717. It includes:

  • ✅ Type-safe handling of both dict and string formats
  • ✅ Comprehensive unit tests
  • ✅ Experiment scripts demonstrating the fix
  • ✅ All code quality checks passing (flake8, black)

The PR is ready for review: https://github.com/OpenSPG/KAG/pull/717

unidel2035 avatar Nov 01 '25 16:11 unidel2035

我发现是模型输出结构不符合 KAG 预期格式,尝试使用如下方法也可解决 修改kag_config.yaml extractor: type: knowledge_unit_extractor llm: *openie_llm

  ner_prompt:
    type: knowledge_unit_ner
    prompt: |
      You are an information extraction assistant.
      Extract NER results strictly as a dictionary of strings.
      All fields MUST be strings, not lists or objects.

      Output JSON format:
      {
        "entities": "EntityA, EntityB",
        "core_entities": "Entity1, Entity2"
      }

      Text:
      {text}

  triple_prompt:
    type: knowledge_unit_triple
    prompt: |
      Extract triples strictly as comma-separated strings.

      REQUIREMENTS:
      - ALL fields MUST be strings.
      - NEVER output arrays or objects.

      EXAMPLE CORRECT:
      {
        "entities": "A, B, C",
        "relations": "r1, r2",
        "core_entities": "X, Y",
        "summary": "..."
      }

      Text:
      {text}

  kn_prompt:
    type: knowledge_unit
    prompt: |
      You are an information extraction model. 
      Extract the knowledge unit from the text.

      REQUIREMENTS:
      - All fields MUST be strings.
      - DO NOT return arrays or objects.
      - DO NOT return dictionaries.
      - If multiple items, join them with commas.
      - core_entities MUST be a comma-separated STRING, not a JSON object.

      STRICT OUTPUT FORMAT (copy exactly):
      {
        "core_entities": "Entity1, Entity2",
        "summary": "One sentence summary",
        "entities": "EntityA, EntityB",
        "relations": "Relation1, Relation2"
      }

      EXAMPLE OF WRONG OUTPUT (NEVER DO THIS):
      {
        "core_entities": {"A": "Type1", "B": "Type2"}   <-- wrong
      }

      EXAMPLE OF CORRECT OUTPUT:
      {
        "core_entities": "A, B"
      }

      Text:
      {text}

更新knowledge_unit_extractor.py vi /root/KAG/kag/builder/component/extractor/knowledge_unit_extractor.py def assemble_knowledge_unit(...): knowledge_unit_nodes = [] knowledge_units = dict(input_knowledge_units)

--- BEGIN: Fix Qwen output core_entities being dict ---

for k, v in knowledge_units.items(): core = v.get("core_entities") if isinstance(core, dict): v["core_entities"] = ", ".join(core.keys())

--- END: Fix Qwen output core_entities being dict ---

def triple_to_knowledge_unit(triple): ...

zhangwh807 avatar Dec 19 '25 03:12 zhangwh807