langextract icon indicating copy to clipboard operation
langextract copied to clipboard

json.decoder.JSONDecodeError: Unterminated string starting at

Open wisteriesDev opened this issue 4 months ago • 4 comments

Hello.

I get json parse errors from resolver.py due to propably wrong structured output from llm

code:

try:
  result_generator = lx.extract(
      text_or_documents=comment_texts,
      prompt_description=topic_prompt,
      examples=topic_examples,
      model_id="gemini-2.5-flash",
      debug=True,
      }
except Exception as e:
    print(f"Caught a parsing error: {e}")

print(f"Extracted {len(result_generator.str)} entities from {len(result_generator.text):,} characters")
#trying to save output for debug
with open("jsondebugtest.jsonl", "w") as f:
    for result in result_generator:
        f.write(result.text)
        f.write("\n")

error trace:

`LangExtract: model=gemini-2.5-flash, current=322 chars, processed=322 chars:  [00:00]ERROR:absl:Failed to parse content.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/langextract/resolver.py", line 349, in _extract_and_parse_content
    parsed_data = json.loads(content)
                  ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
               ^^^^^^^^^^^^^^^^^^^^^^
json.decoder.JSONDecodeError: Unterminated string starting at: line 4 column 16 (char 42)
LangExtract: model=gemini-2.5-flash, current=322 chars, processed=322 chars:  [04:42]Caught a parsing error: Failed to parse content.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
[/tmp/ipython-input-1112811029.py](https://localhost:8080/#) in <cell line: 0>()
     19     print(f"Caught a parsing error: {e}")
     20 
---> 21 print(f"Extracted {len(result_generator.str)} entities from {len(result_generator.text):,} characters")
     22 with open("jsondebugtest.jsonl", "w") as f:
     23     for result in result_generator:

NameError: name 'result_generator' is not defined`

Is there any way to output raw LLM output /inference api response before its go to resolver.py ? its hard to get proper prompt and proper structured output without easy way to debugging llm output

wisteriesDev avatar Aug 12 '25 17:08 wisteriesDev

What you said is indeed correct. To add an additional parameter, we can set whether to allow the original LLM response to be displayed first in the middle. This might be helpful for debugging or identifying the root cause of the problem (such as verifying if the LLM is outputting in the JSON format you desire). Note that when deploying a local LLM using ollama, your local LLM needs to be parameterized to support JSON output, so that it can be parsed by the JSON parser of lx. If the output is not in JSON format, it will indicate a parsing failure. It is also worth noting that for some models, you can instruct them to output in JSON format in the prompt. However, this method is not 100% effective and thus may cause some parsing failures. A better solution is to switch to a model supported by Ollama that can output in JSON format.

Smonkey123 avatar Aug 13 '25 02:08 Smonkey123

You can try add parameter fence_output=True, when calling extract method. It may help.

xinzhuang avatar Aug 13 '25 06:08 xinzhuang

I am using Gemma models locally, which does not support structured output (i.e., it will generate raw JSON text outputs).

To overcome this error, I placed further instructions in the prompt as shown in the example below:

    Return your answer as a JSON object with this format:
    {
        "extractions": [
            {
                "extraction_class": "exclusion",
                "extraction_text": "exact text from the policy document",
                "attributes": {...}
            }
        ]
    }

As a result, I no longer face the same issue.

kennethleungty avatar Aug 16 '25 15:08 kennethleungty

Hi @wisteriesDev,

This parsing error issue should be addressed by PR #239 which introduces a centralized FormatHandler for consistent parsing across all providers. The PR includes proper fence detection and fallback mechanisms to handle various output formats, including edge cases like missing or malformed JSON/YAML structures.

If parsing errors persist after the PR is merged, please reopen with specific examples and reproduction steps.

Thank you for reporting this issue and helping improve LangExtract!

aksg87 avatar Sep 12 '25 11:09 aksg87