BUG: Failed to parse content from a large file (ResolverParsingError)
I have tried with a relatively large file (27197 characters) and it failed.
Here's how I called the extract funcion:
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash",
batch_length=20,
extraction_passes=3, # Multiple passes for improved recall
max_workers=20, # Parallel processing for speed
max_char_buffer=1000 # Smaller contexts for better accuracy
)
and the result:
File ".../langextract/__init__.py", line 226, in extract
return annotator.annotate_text(
~~~~~~~~~~~~~~~~~~~~~~~^
text=text_or_documents,
^^^^^^^^^^^^^^^^^^^^^^^
...<5 lines>...
extraction_passes=extraction_passes,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File ".../langextract/annotation.py", line 505, in annotate_text
annotations = list(
self.annotate_documents(
...<7 lines>...
)
)
File ".../langextract/annotation.py", line 239, in annotate_documents
yield from self._annotate_documents_sequential_passes(
...<7 lines>...
)
File ".../langextract/annotation.py", line 419, in _annotate_documents_sequential_passes
for annotated_doc in self._annotate_documents_single_pass(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
document_list,
^^^^^^^^^^^^^^
...<4 lines>...
**kwargs, # Only show progress on first pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
):
^
File ".../langextract/annotation.py", line 355, in _annotate_documents_single_pass
annotated_chunk_extractions = resolver.resolve(
top_inference_result, debug=debug, **kwargs
)
File ".../langextract/resolver.py", line 230, in resolve
raise ResolverParsingError("Failed to parse content.") from e
langextract.resolver.ResolverParsingError: Failed to parse content.
Hi, with a large file there might have been some unicode characters or other something else that broke the parser.
I would suggest looking at the logs to identify the input when it fails. Also, here is an example in a demo using LX where the input text is sanitized: https://huggingface.co/spaces/google/radextract/blob/main/sanitize.py This should be incorporated directly into the library in the future.
If you identify the culprit text, please share that so the resolver can be reviewed against this.
I'm receiving the same error when I run on some texts, but on some it works perfectly. W/o the sanitize function the parser fails giving this message
LangExtract: Processing, current=8,246 chars, processed=8,246 chars: [00:00]ERROR:absl:Content does not contain 'extractions' key
But after applying the function it still fails after sometime giving this message
LangExtract: Processing, current=5,676 chars, processed=23,626 chars: [00:50]ERROR:absl:Content does not contain 'extractions' key.
It seems like https://huggingface.co/spaces/google/radextract/blob/main/sanitize.py is not a perfect solution
I'm receiving the same error while using models other than Gemini 2.5 Flash or Pro. Mostly while processing the second chunk batch. ERROR:absl:Content does not contain 'extractions' key.
Using an English language source text of 16895 characters, for some prompts I get this exception for both the gemma2:2b and the gemma3:27b models running locally under ollama. However, for more generic variations of the same prompt operating on the exact same source text the library runs through and produces results, albeit too generic ones.
I also get a similar error:
ERROR:absl:Input string does not contain valid markers.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/resolver.py:224](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/resolver.py#line=223), in Resolver.resolve(self, input_text, suppress_parse_errors, **kwargs)
223 try:
--> 224 extraction_data = self.string_to_extraction_data(input_text)
225 logging.debug("Parsed content: %s", extraction_data)
File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/resolver.py:383](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/resolver.py#line=382), in Resolver.string_to_extraction_data(self, input_string)
365 """Parses a YAML or JSON-formatted string into extraction data.
366
367 This function extracts data from a string containing YAML or JSON content.
(...) 381 ValueError: If the input is invalid or does not contain expected format.
382 """
--> 383 parsed_data = self._extract_and_parse_content(input_string)
385 if not isinstance(parsed_data, dict):
File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/resolver.py:342](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/resolver.py#line=341), in Resolver._extract_and_parse_content(self, input_string)
341 logging.error("Input string does not contain valid markers.")
--> 342 raise ValueError("Input string does not contain valid markers.")
344 content = input_string[left + prefix_length : right].strip()
ValueError: Input string does not contain valid markers.
The above exception was the direct cause of the following exception:
ResolverParsingError Traceback (most recent call last)
Cell In[37], line 1
----> 1 result = lx.extract(
2 text_or_documents=sample_input_text, #, [:3000], #input_text[2000:2125], #dummy_example,
3 model_id="bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0",
4 prompt_description=prompt,
5 examples=examples,
6 #batch_length=20,
7 #fence_output=False,
8 #extraction_passes=3,
9 #max_workers=3,
10 #max_char_buffer=1000
11 )
File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/__init__.py:55](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/__init__.py#line=54), in extract(*args, **kwargs)
53 def extract(*args: Any, **kwargs: Any):
54 """Top-level API: lx.extract(...)."""
---> 55 return extract_func(*args, **kwargs)
File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/extraction.py:296](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/extraction.py#line=295), in extract(text_or_documents, prompt_description, examples, model_id, api_key, language_model_type, format_type, max_char_buffer, temperature, fence_output, use_schema_constraints, batch_length, max_workers, additional_context, resolver_params, language_model_params, debug, model_url, extraction_passes, config, model, fetch_urls, prompt_validation_level, prompt_validation_strict)
288 annotator = annotation.Annotator(
289 language_model=language_model,
290 prompt_template=prompt_template,
291 format_type=format_type,
292 fence_output=fence_output,
293 )
295 if isinstance(text_or_documents, str):
--> 296 return annotator.annotate_text(
297 text=text_or_documents,
298 resolver=res,
299 max_char_buffer=max_char_buffer,
300 batch_length=batch_length,
301 additional_context=additional_context,
302 debug=debug,
303 extraction_passes=extraction_passes,
304 max_workers=max_workers,
305 )
306 else:
307 documents = cast(Iterable[data.Document], text_or_documents)
File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/annotation.py:506](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/annotation.py#line=505), in Annotator.annotate_text(self, text, resolver, max_char_buffer, batch_length, additional_context, debug, extraction_passes, **kwargs)
496 start_time = time.time() if debug else None
498 documents = [
499 data.Document(
500 text=text,
(...) 503 )
504 ]
--> 506 annotations = list(
507 self.annotate_documents(
508 documents,
509 resolver,
510 max_char_buffer,
511 batch_length,
512 debug,
513 extraction_passes,
514 **kwargs,
515 )
516 )
517 assert (
518 len(annotations) == 1
519 ), f"Expected 1 annotation but got {len(annotations)} annotations."
521 if debug and annotations[0].extractions:
File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/annotation.py:236](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/annotation.py#line=235), in Annotator.annotate_documents(self, documents, resolver, max_char_buffer, batch_length, debug, extraction_passes, **kwargs)
206 """Annotates a sequence of documents with NLP extractions.
207
208 Breaks documents into chunks, processes them into prompts and performs
(...) 232 ValueError: If there are no scored outputs during inference.
233 """
235 if extraction_passes == 1:
--> 236 yield from self._annotate_documents_single_pass(
237 documents, resolver, max_char_buffer, batch_length, debug, **kwargs
238 )
239 else:
240 yield from self._annotate_documents_sequential_passes(
241 documents,
242 resolver,
(...) 247 **kwargs,
248 )
File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/annotation.py:356](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/annotation.py#line=355), in Annotator._annotate_documents_single_pass(self, documents, resolver, max_char_buffer, batch_length, debug, **kwargs)
353 top_inference_result = scored_outputs[0].output
354 logging.debug("Top inference result: %s", top_inference_result)
--> 356 annotated_chunk_extractions = resolver.resolve(
357 top_inference_result, debug=debug, **kwargs
358 )
359 chunk_text = text_chunk.chunk_text
360 token_offset = text_chunk.token_interval.start_index
File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/resolver.py:233](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/resolver.py#line=232), in Resolver.resolve(self, input_text, suppress_parse_errors, **kwargs)
229 logging.exception(
230 "Failed to parse input_text: %s, error: %s", input_text, e
231 )
232 return []
--> 233 raise ResolverParsingError("Failed to parse content.") from e
235 processed_extractions = self.extract_ordered_extractions(extraction_data)
237 logging.debug("Completed the resolver process.")
ResolverParsingError: Failed to parse content.