batch langextract: langextract.resolver.ResolverParsingError: Failed to parse JSON content: Unterminated string starting at
Describe the overall issue and situation
When attempting to follow the batch tutorial at https://github.com/google/langextract/blob/main/docs/examples/batch_api_example.md, an error occurs when reading the batch predictions with:
`Traceback (most recent call last): File "/opt/venv/lib/python3.12/site-packages/langextract/core/format_handler.py", line 176, in parse_output parsed = json.loads(content) ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/json/init.py", line 346, in loads return _default_decoder.decode(s) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/json/decoder.py", line 338, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/json/decoder.py", line 354, in raw_decode obj, end = self.scan_once(s, idx) ^^^^^^^^^^^^^^^^^^^^^^ json.decoder.JSONDecodeError: Unterminated string starting at: line 7 column 16 (char 163)
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/opt/venv/lib/python3.12/site-packages/langextract/resolver.py", line 260, in resolve extraction_data = self.format_handler.parse_output( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/venv/lib/python3.12/site-packages/langextract/core/format_handler.py", line 182, in parse_output raise exceptions.FormatParseError(msg) from e langextract.core.exceptions.FormatParseError: Failed to parse JSON content: Unterminated string starting at: line 7 column 16 (char 163)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/ds/etl.py", line 33, in
It looks like it isn't setup to properly read json line files, and is expecting regular JSON.
Expected Behavior I would expect the parser to know it is reading a JSONL file rather than regular JSON and read the predictions line by line.
Actual Behavior The parser appears to read the JSONL file as a regular JSON file, and runs into errors.
Steps to Reproduce the Issue
After pre-processing the examples, the prompt, and the incoming inference, I attempt to run the following code with langextract 1.1.0:
# Configure batch settings batch_config = { "enabled": True, "threshold": 10, "poll_interval": 30, "timeout": 3600, "enable_caching": True, "retention_days": 30, } # Running Extraction results = lx.extract( text_or_documents=documents, prompt_description=prompt_examples.get('prompt'), examples=examples, model_id=prompt_examples.get('model'), batch_length=1000, language_model_params={ "vertexai": True, "project": "your-project-here", "location": "us-central1", "batch": batch_config, } )
The batch job is created correctly, runs through, and the predictions file is created, but then langextract fails when reading it.
Proposed Solution
You can grab the JSONL file and correctly pull out the predictions by reading them line by line, this snippet works for example:
processed_records = [] with open('lang_extract_batch_output.jsonl', 'r') as f: for i, line in enumerate(f): try: record = json.loads(line) processed_records.append(record) except: print(f'Skipping record {i} due to bad parse')
Using this, each line was placed into the "processed_records" list correctly.
This keeps coming up whenever the LLM decided to go crazy and do repetitive text when trying to produce the response. A batch might have worked on 99.999% of the records, but one bad actor causes the whole job to fail with no recovery.
Can we write out a soft error for failing rows, come back with a null extraction set for that record, and continue onwards? Here's an example of what showed up in the predictions.jsonl that causes the whole job to fail. This happens on both gemini 2.5 flash and gemini 2.5 pro (posting the first 500 characters, as it ends up being 188998 characters in total):
thoughtblockquote_t)\n```_content_t)\n```_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_cont
It also appears adding:
resolver_params={ "suppress_parse_errors": True }
To the extract does not do anything