[Bug] [kag/builder] TypeError: Object of type Chunk is not JSON serializable
Search before asking
- [x] I had searched in the issues and found no similar issues.
Operating system information
Windows
What happened
In the KGWriter's standarlize_graph method, there's a JSON serialization attempt for non-string property values. The error occurs because a Chunk object is being stored as a property value somewhere in the graph nodes. I've resolved the "Object of type Chunk is not JSON serializable" error by implementing proper JSON serialization support for Chunk objects:
- Added a custom ChunkEncoder class that properly serializes Chunk objects using their to_dict() method
- Modified the dump_chunks function to use this encoder when writing chunks to files
- Updated the KGWriter to use the ChunkEncoder when serializing node and edge properties
These changes ensure that Chunk objects are properly serialized to JSON throughout the system, maintaining the existing functionality while fixing the serialization error.
How to reproduce
def build_pdf_knowledge(pdf_path): """ Build knowledge from PDF documents using the unstructured builder chain. Uses semantic segmentation and schema constraints for EA concept extraction.
Args:
pdf_path: Path to the PDF file to process
"""
# Initialize LLM client from config
llm = LLMClient.from_config(KAG_CONFIG.all_config["openie_llm"])
# Initialize EA-aware NER prompt
ner_prompt = PromptABC.from_config({
"type": "default_ner", # Using our enhanced EA-aware NER
"language": KAG_PROJECT_CONF.language
})
# Initialize components
reader = PDFReader(
cut_depth=3, # Maximum depth for outline-based splitting
outline_flag=True, # Enable outline-based document structuring
llm=llm
)
# Split into manageable chunks
splitter = LengthSplitter(
split_length=2000, # Characters per chunk
window_length=200 # Overlap between chunks
)
# Initialize extractor with EA-aware NER and standardization
extractor = SchemaConstraintExtractor(
llm=llm,
ner_prompt=ner_prompt, # Use EA-aware NER for entity extraction
std_prompt=PromptABC.from_config({ # Add standardization prompt
"type": "default_std",
"language": KAG_PROJECT_CONF.language
}),
relation_prompt=None, # Skip relation extraction
event_prompt=None # Skip event extraction
)
# Vectorizer
# vectorizer = BatchVectorizer(vectorize_model=MockVectorizeModel(vector_dimensions=768))
# Initialize writer
writer = KGWriter()
# Create the builder chain
chain = DefaultUnstructuredBuilderChain(
reader=reader,
splitter=splitter,
extractor=extractor,
writer=writer
)
# Process the PDF
try:
chain.invoke(pdf_path)
logger.info(f"Successfully processed PDF: {pdf_path}")
except Exception as e:
logger.error(f"Error processing PDF {pdf_path}: {e}")
raise
if name == "main": # Example usage dir_path = os.path.dirname(file) pdf_path = os.path.join(dir_path, "data/erp_rfp.pdf") build_pdf_knowledge(pdf_path)
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Thank you so much for your assistance! Would you be able to submit a PR to merge your changes into the master branch?
Same issue:
File "/KAGTest/KAG/kag/builder/component/writer/kg_writer.py", line 137, in _invoke
input = self.standarlize_graph(input)
File "/KAGTest/KAG/kag/builder/component/writer/kg_writer.py", line 101, in standarlize_graph
print(json.dumps(v, ensure_ascii=False))
File "/opt/anaconda3/envs/kag-demo/lib/python3.10/json/__init__.py", line 238, in dumps
**kw).encode(obj)
File "/opt/anaconda3/envs/kag-demo/lib/python3.10/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/opt/anaconda3/envs/kag-demo/lib/python3.10/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/opt/anaconda3/envs/kag-demo/lib/python3.10/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Chunk is not JSON serializable
data:
[<Chunk>: {'id': '48810f04bbcd9ac6d72063f1b0bab04e7192a35c5c8054353f1f1d6785115392', 'name': 'xxxxx#bbbb', 'content': 'xxxxxxxxx ...'}, <Chunk>: {'id': 'eb8344e980bdb6d8daca98f17488394be5e96ba6a0c5b2c00b1889b8cbee1a82', 'name': 'xxxxx#aaaaa', 'content': 'xxxxx ...'}
@zhuzhongshu123 Thanks a lot
This bug has been fixed in KAG 0.8 version