KAG icon indicating copy to clipboard operation
KAG copied to clipboard

fix(builder): handle both dict and string formats for core_entities in knowledge unit extractor

Open unidel2035 opened this issue 1 month ago โ€ข 1 comments

๐ŸŽฏ Summary

This PR fixes the AttributeError: 'dict' object has no attribute 'split' error that occurs during knowledge extraction, as reported in issue #714.

๐Ÿ› Problem

The knowledge unit extractor was failing when processing core_entities field that comes in dictionary format from the LLM, because the code only expected string format and attempted to call .split(",") on the value.

Root Cause

The LLM can return core_entities in two different formats depending on the language and prompt:

  1. String format (Chinese example): "ๆ ธๅฟƒๅฎžไฝ“": "็ซ็”ตๅ‘็”ต้‡,ๅŒๆฏ”ๅขž้•ฟ็އ,2019ๅนด"
  2. Dict format (English example): "Core Entities": {"T.I.": "Person", "No Mediocre": "Culture and Entertainment"}

The code at kag/builder/component/extractor/knowledge_unit_extractor.py:587 only handled the string format:

for item in knowledge_value.get("core_entities", "").split(","):
    # This fails when core_entities is a dict!

Error Stack Trace

AttributeError: 'dict' object has no attribute 'split'
  File "/kag/builder/component/extractor/knowledge_unit_extractor.py", line 587, in assemble_knowledge_unit
    for item in knowledge_value.get("core_entities", "").split(","):

โœ… Solution

Modified the assemble_knowledge_unit method in knowledge_unit_extractor.py to handle both formats gracefully:

core_entities_raw = knowledge_value.get("core_entities", "")

# Handle both string and dict formats for core_entities
if isinstance(core_entities_raw, dict):
    # Dict format: {entity_name: entity_type}
    core_entities = core_entities_raw
elif isinstance(core_entities_raw, str):
    # String format: comma-separated values
    for item in core_entities_raw.split(","):
        if not item.strip():
            continue
        core_entities[item.strip()] = "Others"
else:
    # Handle unexpected types gracefully with logging
    logger.warning(
        f"Unexpected type for core_entities: {type(core_entities_raw)}, "
        f"expected str or dict. Value: {core_entities_raw}"
    )

๐Ÿงช Testing

  1. Experiment Scripts: Created comprehensive test scripts in experiments/ directory to verify the fix handles all scenarios:

    • String format (Chinese)
    • Dict format (English)
    • Empty strings
    • Missing fields
    • Invalid types (with proper logging)
  2. Unit Tests: Added test_knowledge_unit_core_entities.py with comprehensive test coverage for all core_entities formats

  3. Code Quality: All changes pass flake8 validation

๐Ÿ“ Changes

  • Modified: kag/builder/component/extractor/knowledge_unit_extractor.py - Added type checking and handling for both dict and string formats
  • Added: tests/unit/builder/component/test_knowledge_unit_core_entities.py - Unit tests for the fix
  • Added: experiments/test_core_entities_handling.py - Experiment script demonstrating the issue and fix
  • Added: experiments/test_fix.py - Verification script for all scenarios

๐Ÿ”— Related Issues

Fixes #714


๐Ÿค– Generated with Claude Code

unidel2035 avatar Nov 01 '25 16:11 unidel2035

๐Ÿค– Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

๐Ÿ“Ž Log file uploaded as GitHub Gist (359KB) ๐Ÿ”— View complete solution draft log


Now working session is ended, feel free to review and add any feedback on the solution draft.

unidel2035 avatar Nov 01 '25 16:11 unidel2035