KAG icon indicating copy to clipboard operation
KAG copied to clipboard

[Bug] [kag/builder] TypeError: Object of type Chunk is not JSON serializable

Open francescomagalini opened this issue 11 months ago • 2 comments

Search before asking

  • [x] I had searched in the issues and found no similar issues.

Operating system information

Windows

What happened

In the KGWriter's standarlize_graph method, there's a JSON serialization attempt for non-string property values. The error occurs because a Chunk object is being stored as a property value somewhere in the graph nodes. I've resolved the "Object of type Chunk is not JSON serializable" error by implementing proper JSON serialization support for Chunk objects:

  • Added a custom ChunkEncoder class that properly serializes Chunk objects using their to_dict() method
  • Modified the dump_chunks function to use this encoder when writing chunks to files
  • Updated the KGWriter to use the ChunkEncoder when serializing node and edge properties

These changes ensure that Chunk objects are properly serialized to JSON throughout the system, maintaining the existing functionality while fixing the serialization error.

How to reproduce

def build_pdf_knowledge(pdf_path): """ Build knowledge from PDF documents using the unstructured builder chain. Uses semantic segmentation and schema constraints for EA concept extraction.

Args:
    pdf_path: Path to the PDF file to process
"""
# Initialize LLM client from config
llm = LLMClient.from_config(KAG_CONFIG.all_config["openie_llm"])

# Initialize EA-aware NER prompt
ner_prompt = PromptABC.from_config({
    "type": "default_ner",  # Using our enhanced EA-aware NER
    "language": KAG_PROJECT_CONF.language
})

# Initialize components
reader = PDFReader(
    cut_depth=3,  # Maximum depth for outline-based splitting
    outline_flag=True,  # Enable outline-based document structuring
    llm=llm
)

# Split into manageable chunks
splitter = LengthSplitter(
    split_length=2000,  # Characters per chunk
    window_length=200  # Overlap between chunks
)

# Initialize extractor with EA-aware NER and standardization
extractor = SchemaConstraintExtractor(
    llm=llm,
    ner_prompt=ner_prompt,  # Use EA-aware NER for entity extraction
    std_prompt=PromptABC.from_config({  # Add standardization prompt
        "type": "default_std",
        "language": KAG_PROJECT_CONF.language
    }),
    relation_prompt=None,  # Skip relation extraction
    event_prompt=None  # Skip event extraction
)

# Vectorizer
# vectorizer = BatchVectorizer(vectorize_model=MockVectorizeModel(vector_dimensions=768))

# Initialize writer
writer = KGWriter()

# Create the builder chain
chain = DefaultUnstructuredBuilderChain(
    reader=reader,
    splitter=splitter,
    extractor=extractor,
    writer=writer
)

# Process the PDF
try:
    chain.invoke(pdf_path)
    logger.info(f"Successfully processed PDF: {pdf_path}")
except Exception as e:
    logger.error(f"Error processing PDF {pdf_path}: {e}")
    raise

if name == "main": # Example usage dir_path = os.path.dirname(file) pdf_path = os.path.join(dir_path, "data/erp_rfp.pdf") build_pdf_knowledge(pdf_path)

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

francescomagalini avatar Jan 19 '25 07:01 francescomagalini

Thank you so much for your assistance! Would you be able to submit a PR to merge your changes into the master branch?

zhuzhongshu123 avatar Jan 20 '25 02:01 zhuzhongshu123

Same issue:

  File "/KAGTest/KAG/kag/builder/component/writer/kg_writer.py", line 137, in _invoke
    input = self.standarlize_graph(input)
  File "/KAGTest/KAG/kag/builder/component/writer/kg_writer.py", line 101, in standarlize_graph
    print(json.dumps(v, ensure_ascii=False))
  File "/opt/anaconda3/envs/kag-demo/lib/python3.10/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/opt/anaconda3/envs/kag-demo/lib/python3.10/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/opt/anaconda3/envs/kag-demo/lib/python3.10/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/opt/anaconda3/envs/kag-demo/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Chunk is not JSON serializable

data:

[<Chunk>: {'id': '48810f04bbcd9ac6d72063f1b0bab04e7192a35c5c8054353f1f1d6785115392', 'name': 'xxxxx#bbbb', 'content': 'xxxxxxxxx ...'}, <Chunk>: {'id': 'eb8344e980bdb6d8daca98f17488394be5e96ba6a0c5b2c00b1889b8cbee1a82', 'name': 'xxxxx#aaaaa', 'content': 'xxxxx ...'}

@zhuzhongshu123 Thanks a lot

Like0x avatar May 06 '25 16:05 Like0x

This bug has been fixed in KAG 0.8 version

caszkgui avatar Aug 16 '25 01:08 caszkgui