Robustness: Library fails on LLM outputs containing CJK Radicals and malformed JSON
The langextract library currently lacks robustness in handling common "messy" outputs from Large Language Models (LLMs). Specifically, it fails in two ways:
Unicode Normalization Failure: When the LLM returns strings containing CJK compatibility characters or radicals (e.g., ⻬ U+2EEC instead of 齐 U+9F50, or ⺠ U+2⺠5 instead of 民 U+6C11), the library does not normalize them. This leads to silent data corruption where the stored entities do not match standard characters, causing issues in downstream applications like graph databases.
JSON Parsing Failure: When the LLM produces a slightly malformed JSON string (e.g., with unescaped quotes), the library crashes with a json.decoder.JSONDecodeError.
These issues make the library brittle in real-world use cases where LLM outputs are not always perfectly clean.
similar issue
result = lx.extract(
^^^^^^^^^^^
File "/Users/liangdeo/.virtualenvs/data/lib/python3.12/site-packages/langextract/__init__.py", line 291, in extract
return annotator.annotate_text(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/liangdeo/.virtualenvs/data/lib/python3.12/site-packages/langextract/annotation.py", line 506, in annotate_text
annotations = list(
^^^^^
File "/Users/liangdeo/.virtualenvs/data/lib/python3.12/site-packages/langextract/annotation.py", line 236, in annotate_documents
yield from self._annotate_documents_single_pass(
File "/Users/liangdeo/.virtualenvs/data/lib/python3.12/site-packages/langextract/annotation.py", line 356, in _annotate_documents_single_pass
annotated_chunk_extractions = resolver.resolve(
^^^^^^^^^^^^^^^^^
File "/Users/liangdeo/.virtualenvs/data/lib/python3.12/site-packages/langextract/resolver.py", line 235, in resolve
processed_extractions = self.extract_ordered_extractions(extraction_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/liangdeo/.virtualenvs/data/lib/python3.12/site-packages/langextract/resolver.py", line 478, in extract_ordered_extractions
raise ValueError(
ValueError: Extraction value must be a dict or None for attributes.