fix(builder): handle both dict and string formats for core_entities in knowledge unit extractor

Open unidel2035 opened this issue 1 month ago • 1 comments

🎯 Summary

This PR fixes the AttributeError: 'dict' object has no attribute 'split' error that occurs during knowledge extraction, as reported in issue #714.

🐛 Problem

The knowledge unit extractor was failing when processing core_entities field that comes in dictionary format from the LLM, because the code only expected string format and attempted to call .split(",") on the value.

Root Cause

The LLM can return core_entities in two different formats depending on the language and prompt:

String format (Chinese example): "核心实体": "火电发电量,同比增长率,2019年"
Dict format (English example): "Core Entities": {"T.I.": "Person", "No Mediocre": "Culture and Entertainment"}

The code at kag/builder/component/extractor/knowledge_unit_extractor.py:587 only handled the string format:

for item in knowledge_value.get("core_entities", "").split(","):
    # This fails when core_entities is a dict!

Error Stack Trace

AttributeError: 'dict' object has no attribute 'split'
  File "/kag/builder/component/extractor/knowledge_unit_extractor.py", line 587, in assemble_knowledge_unit
    for item in knowledge_value.get("core_entities", "").split(","):

✅ Solution

Modified the assemble_knowledge_unit method in knowledge_unit_extractor.py to handle both formats gracefully:

core_entities_raw = knowledge_value.get("core_entities", "")

# Handle both string and dict formats for core_entities
if isinstance(core_entities_raw, dict):
    # Dict format: {entity_name: entity_type}
    core_entities = core_entities_raw
elif isinstance(core_entities_raw, str):
    # String format: comma-separated values
    for item in core_entities_raw.split(","):
        if not item.strip():
            continue
        core_entities[item.strip()] = "Others"
else:
    # Handle unexpected types gracefully with logging
    logger.warning(
        f"Unexpected type for core_entities: {type(core_entities_raw)}, "
        f"expected str or dict. Value: {core_entities_raw}"
    )

🧪 Testing

Experiment Scripts: Created comprehensive test scripts in experiments/ directory to verify the fix handles all scenarios:
- String format (Chinese)
- Dict format (English)
- Empty strings
- Missing fields
- Invalid types (with proper logging)
Unit Tests: Added test_knowledge_unit_core_entities.py with comprehensive test coverage for all core_entities formats
Code Quality: All changes pass flake8 validation

📝 Changes

Modified: kag/builder/component/extractor/knowledge_unit_extractor.py - Added type checking and handling for both dict and string formats
Added: tests/unit/builder/component/test_knowledge_unit_core_entities.py - Unit tests for the fix
Added: experiments/test_core_entities_handling.py - Experiment script demonstrating the issue and fix
Added: experiments/test_fix.py - Verification script for all scenarios

🔗 Related Issues

Fixes #714

🤖 Generated with Claude Code

Nov 01 '25 16:11 unidel2035

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

📎 Log file uploaded as GitHub Gist (359KB) 🔗 View complete solution draft log

Now working session is ended, feel free to review and add any feedback on the solution draft.

Nov 01 '25 16:11 unidel2035