langextract icon indicating copy to clipboard operation
langextract copied to clipboard

batch langextract: langextract.resolver.ResolverParsingError: Failed to parse JSON content: Unterminated string starting at

Open eakertFacet opened this issue 1 month ago • 2 comments

Describe the overall issue and situation

When attempting to follow the batch tutorial at https://github.com/google/langextract/blob/main/docs/examples/batch_api_example.md, an error occurs when reading the batch predictions with:

`Traceback (most recent call last): File "/opt/venv/lib/python3.12/site-packages/langextract/core/format_handler.py", line 176, in parse_output parsed = json.loads(content) ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/json/init.py", line 346, in loads return _default_decoder.decode(s) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/json/decoder.py", line 338, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/json/decoder.py", line 354, in raw_decode obj, end = self.scan_once(s, idx) ^^^^^^^^^^^^^^^^^^^^^^ json.decoder.JSONDecodeError: Unterminated string starting at: line 7 column 16 (char 163)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/venv/lib/python3.12/site-packages/langextract/resolver.py", line 260, in resolve extraction_data = self.format_handler.parse_output( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/venv/lib/python3.12/site-packages/langextract/core/format_handler.py", line 182, in parse_output raise exceptions.FormatParseError(msg) from e langextract.core.exceptions.FormatParseError: Failed to parse JSON content: Unterminated string starting at: line 7 column 16 (char 163)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/ds/etl.py", line 33, in etl_switch() File "/ds/etl.py", line 26, in etl_switch numeric_extraction.prepare_extract_intro_call_numbers() File "/ds/scripts/numeric_extraction.py", line 221, in prepare_extract_intro_call_numbers results = lx.extract( ^^^^^^^^^^^ File "/opt/venv/lib/python3.12/site-packages/langextract/init.py", line 55, in extract return extract_func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/venv/lib/python3.12/site-packages/langextract/extraction.py", line 358, in extract return list(result) ^^^^^^^^^^^^ File "/opt/venv/lib/python3.12/site-packages/langextract/annotation.py", line 255, in annotate_documents yield from self._annotate_documents_single_pass( File "/opt/venv/lib/python3.12/site-packages/langextract/annotation.py", line 388, in _annotate_documents_single_pass resolved_extractions = resolver.resolve( ^^^^^^^^^^^^^^^^^ File "/opt/venv/lib/python3.12/site-packages/langextract/resolver.py", line 271, in resolve raise ResolverParsingError(str(e)) from e langextract.resolver.ResolverParsingError: Failed to parse JSON content: Unterminated string starting at: line 7 column 16 (char 163)`

It looks like it isn't setup to properly read json line files, and is expecting regular JSON.

Expected Behavior I would expect the parser to know it is reading a JSONL file rather than regular JSON and read the predictions line by line.

Actual Behavior The parser appears to read the JSONL file as a regular JSON file, and runs into errors.

Steps to Reproduce the Issue After pre-processing the examples, the prompt, and the incoming inference, I attempt to run the following code with langextract 1.1.0: # Configure batch settings batch_config = { "enabled": True, "threshold": 10, "poll_interval": 30, "timeout": 3600, "enable_caching": True, "retention_days": 30, } # Running Extraction results = lx.extract( text_or_documents=documents, prompt_description=prompt_examples.get('prompt'), examples=examples, model_id=prompt_examples.get('model'), batch_length=1000, language_model_params={ "vertexai": True, "project": "your-project-here", "location": "us-central1", "batch": batch_config, } )

The batch job is created correctly, runs through, and the predictions file is created, but then langextract fails when reading it.

Proposed Solution You can grab the JSONL file and correctly pull out the predictions by reading them line by line, this snippet works for example: processed_records = [] with open('lang_extract_batch_output.jsonl', 'r') as f: for i, line in enumerate(f): try: record = json.loads(line) processed_records.append(record) except: print(f'Skipping record {i} due to bad parse')

Using this, each line was placed into the "processed_records" list correctly.

eakertFacet avatar Nov 27 '25 03:11 eakertFacet

This keeps coming up whenever the LLM decided to go crazy and do repetitive text when trying to produce the response. A batch might have worked on 99.999% of the records, but one bad actor causes the whole job to fail with no recovery.

Can we write out a soft error for failing rows, come back with a null extraction set for that record, and continue onwards? Here's an example of what showed up in the predictions.jsonl that causes the whole job to fail. This happens on both gemini 2.5 flash and gemini 2.5 pro (posting the first 500 characters, as it ends up being 188998 characters in total):

thoughtblockquote_t)\n```_content_t)\n```_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_content_t)\n{"extra_cont

eakertFacet avatar Dec 08 '25 20:12 eakertFacet

It also appears adding: resolver_params={ "suppress_parse_errors": True }

To the extract does not do anything

eakertFacet avatar Dec 08 '25 21:12 eakertFacet