langextract icon indicating copy to clipboard operation
langextract copied to clipboard

Robustness: Library fails on LLM outputs containing CJK Radicals and malformed JSON

Open via007 opened this issue 3 months ago • 1 comments

The langextract library currently lacks robustness in handling common "messy" outputs from Large Language Models (LLMs). Specifically, it fails in two ways:

Unicode Normalization Failure: When the LLM returns strings containing CJK compatibility characters or radicals (e.g., ⻬ U+2EEC instead of 齐 U+9F50, or ⺠ U+2⺠5 instead of 民 U+6C11), the library does not normalize them. This leads to silent data corruption where the stored entities do not match standard characters, causing issues in downstream applications like graph databases.

JSON Parsing Failure: When the LLM produces a slightly malformed JSON string (e.g., with unescaped quotes), the library crashes with a json.decoder.JSONDecodeError.

These issues make the library brittle in real-world use cases where LLM outputs are not always perfectly clean.

via007 avatar Sep 02 '25 08:09 via007

similar issue

    result = lx.extract(
             ^^^^^^^^^^^
  File "/Users/liangdeo/.virtualenvs/data/lib/python3.12/site-packages/langextract/__init__.py", line 291, in extract
    return annotator.annotate_text(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/liangdeo/.virtualenvs/data/lib/python3.12/site-packages/langextract/annotation.py", line 506, in annotate_text
    annotations = list(
                  ^^^^^
  File "/Users/liangdeo/.virtualenvs/data/lib/python3.12/site-packages/langextract/annotation.py", line 236, in annotate_documents
    yield from self._annotate_documents_single_pass(
  File "/Users/liangdeo/.virtualenvs/data/lib/python3.12/site-packages/langextract/annotation.py", line 356, in _annotate_documents_single_pass
    annotated_chunk_extractions = resolver.resolve(
                                  ^^^^^^^^^^^^^^^^^
  File "/Users/liangdeo/.virtualenvs/data/lib/python3.12/site-packages/langextract/resolver.py", line 235, in resolve
    processed_extractions = self.extract_ordered_extractions(extraction_data)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/liangdeo/.virtualenvs/data/lib/python3.12/site-packages/langextract/resolver.py", line 478, in extract_ordered_extractions
    raise ValueError(
ValueError: Extraction value must be a dict or None for attributes.

DeoLeung avatar Sep 05 '25 13:09 DeoLeung