langextract icon indicating copy to clipboard operation
langextract copied to clipboard

BUG: Failed to parse content from a large file (ResolverParsingError)

Open franciscobmacedo opened this issue 4 months ago • 5 comments

I have tried with a relatively large file (27197 characters) and it failed.

Here's how I called the extract funcion:

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    batch_length=20,
    extraction_passes=3,      # Multiple passes for improved recall
    max_workers=20,           # Parallel processing for speed
    max_char_buffer=1000      # Smaller contexts for better accuracy
)

and the result:

File ".../langextract/__init__.py", line 226, in extract
    return annotator.annotate_text(
           ~~~~~~~~~~~~~~~~~~~~~~~^
        text=text_or_documents,
        ^^^^^^^^^^^^^^^^^^^^^^^
    ...<5 lines>...
        extraction_passes=extraction_passes,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File ".../langextract/annotation.py", line 505, in annotate_text
    annotations = list(
        self.annotate_documents(
    ...<7 lines>...
        )
    )
  File ".../langextract/annotation.py", line 239, in annotate_documents
    yield from self._annotate_documents_sequential_passes(
    ...<7 lines>...
    )
  File ".../langextract/annotation.py", line 419, in _annotate_documents_sequential_passes
    for annotated_doc in self._annotate_documents_single_pass(
                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        document_list,
        ^^^^^^^^^^^^^^
    ...<4 lines>...
        **kwargs,  # Only show progress on first pass
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ):
    ^
  File ".../langextract/annotation.py", line 355, in _annotate_documents_single_pass
    annotated_chunk_extractions = resolver.resolve(
        top_inference_result, debug=debug, **kwargs
    )
  File ".../langextract/resolver.py", line 230, in resolve
    raise ResolverParsingError("Failed to parse content.") from e
langextract.resolver.ResolverParsingError: Failed to parse content.

franciscobmacedo avatar Jul 31 '25 09:07 franciscobmacedo

Hi, with a large file there might have been some unicode characters or other something else that broke the parser.

I would suggest looking at the logs to identify the input when it fails. Also, here is an example in a demo using LX where the input text is sanitized: https://huggingface.co/spaces/google/radextract/blob/main/sanitize.py This should be incorporated directly into the library in the future.

If you identify the culprit text, please share that so the resolver can be reviewed against this.

aksg87 avatar Aug 01 '25 06:08 aksg87

I'm receiving the same error when I run on some texts, but on some it works perfectly. W/o the sanitize function the parser fails giving this message

LangExtract: Processing, current=8,246 chars, processed=8,246 chars: [00:00]ERROR:absl:Content does not contain 'extractions' key

But after applying the function it still fails after sometime giving this message

LangExtract: Processing, current=5,676 chars, processed=23,626 chars: [00:50]ERROR:absl:Content does not contain 'extractions' key.

It seems like https://huggingface.co/spaces/google/radextract/blob/main/sanitize.py is not a perfect solution

AliHaider20 avatar Aug 13 '25 09:08 AliHaider20

I'm receiving the same error while using models other than Gemini 2.5 Flash or Pro. Mostly while processing the second chunk batch. ERROR:absl:Content does not contain 'extractions' key.

pritkudale avatar Aug 13 '25 12:08 pritkudale

Using an English language source text of 16895 characters, for some prompts I get this exception for both the gemma2:2b and the gemma3:27b models running locally under ollama. However, for more generic variations of the same prompt operating on the exact same source text the library runs through and produces results, albeit too generic ones.

mwkuster avatar Aug 22 '25 09:08 mwkuster

I also get a similar error:

ERROR:absl:Input string does not contain valid markers.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/resolver.py:224](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/resolver.py#line=223), in Resolver.resolve(self, input_text, suppress_parse_errors, **kwargs)
    223 try:
--> 224   extraction_data = self.string_to_extraction_data(input_text)
    225   logging.debug("Parsed content: %s", extraction_data)

File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/resolver.py:383](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/resolver.py#line=382), in Resolver.string_to_extraction_data(self, input_string)
    365 """Parses a YAML or JSON-formatted string into extraction data.
    366 
    367 This function extracts data from a string containing YAML or JSON content.
   (...)    381     ValueError: If the input is invalid or does not contain expected format.
    382 """
--> 383 parsed_data = self._extract_and_parse_content(input_string)
    385 if not isinstance(parsed_data, dict):

File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/resolver.py:342](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/resolver.py#line=341), in Resolver._extract_and_parse_content(self, input_string)
    341   logging.error("Input string does not contain valid markers.")
--> 342   raise ValueError("Input string does not contain valid markers.")
    344 content = input_string[left + prefix_length : right].strip()

ValueError: Input string does not contain valid markers.

The above exception was the direct cause of the following exception:

ResolverParsingError                      Traceback (most recent call last)
Cell In[37], line 1
----> 1 result = lx.extract(
      2     text_or_documents=sample_input_text, #, [:3000], #input_text[2000:2125], #dummy_example, 
      3     model_id="bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0",
      4     prompt_description=prompt,
      5     examples=examples,
      6     #batch_length=20,
      7     #fence_output=False,
      8     #extraction_passes=3,
      9     #max_workers=3, 
     10     #max_char_buffer=1000
     11 )

File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/__init__.py:55](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/__init__.py#line=54), in extract(*args, **kwargs)
     53 def extract(*args: Any, **kwargs: Any):
     54   """Top-level API: lx.extract(...)."""
---> 55   return extract_func(*args, **kwargs)

File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/extraction.py:296](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/extraction.py#line=295), in extract(text_or_documents, prompt_description, examples, model_id, api_key, language_model_type, format_type, max_char_buffer, temperature, fence_output, use_schema_constraints, batch_length, max_workers, additional_context, resolver_params, language_model_params, debug, model_url, extraction_passes, config, model, fetch_urls, prompt_validation_level, prompt_validation_strict)
    288 annotator = annotation.Annotator(
    289     language_model=language_model,
    290     prompt_template=prompt_template,
    291     format_type=format_type,
    292     fence_output=fence_output,
    293 )
    295 if isinstance(text_or_documents, str):
--> 296   return annotator.annotate_text(
    297       text=text_or_documents,
    298       resolver=res,
    299       max_char_buffer=max_char_buffer,
    300       batch_length=batch_length,
    301       additional_context=additional_context,
    302       debug=debug,
    303       extraction_passes=extraction_passes,
    304       max_workers=max_workers,
    305   )
    306 else:
    307   documents = cast(Iterable[data.Document], text_or_documents)

File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/annotation.py:506](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/annotation.py#line=505), in Annotator.annotate_text(self, text, resolver, max_char_buffer, batch_length, additional_context, debug, extraction_passes, **kwargs)
    496 start_time = time.time() if debug else None
    498 documents = [
    499     data.Document(
    500         text=text,
   (...)    503     )
    504 ]
--> 506 annotations = list(
    507     self.annotate_documents(
    508         documents,
    509         resolver,
    510         max_char_buffer,
    511         batch_length,
    512         debug,
    513         extraction_passes,
    514         **kwargs,
    515     )
    516 )
    517 assert (
    518     len(annotations) == 1
    519 ), f"Expected 1 annotation but got {len(annotations)} annotations."
    521 if debug and annotations[0].extractions:

File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/annotation.py:236](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/annotation.py#line=235), in Annotator.annotate_documents(self, documents, resolver, max_char_buffer, batch_length, debug, extraction_passes, **kwargs)
    206 """Annotates a sequence of documents with NLP extractions.
    207 
    208   Breaks documents into chunks, processes them into prompts and performs
   (...)    232   ValueError: If there are no scored outputs during inference.
    233 """
    235 if extraction_passes == 1:
--> 236   yield from self._annotate_documents_single_pass(
    237       documents, resolver, max_char_buffer, batch_length, debug, **kwargs
    238   )
    239 else:
    240   yield from self._annotate_documents_sequential_passes(
    241       documents,
    242       resolver,
   (...)    247       **kwargs,
    248   )

File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/annotation.py:356](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/annotation.py#line=355), in Annotator._annotate_documents_single_pass(self, documents, resolver, max_char_buffer, batch_length, debug, **kwargs)
    353 top_inference_result = scored_outputs[0].output
    354 logging.debug("Top inference result: %s", top_inference_result)
--> 356 annotated_chunk_extractions = resolver.resolve(
    357     top_inference_result, debug=debug, **kwargs
    358 )
    359 chunk_text = text_chunk.chunk_text
    360 token_offset = text_chunk.token_interval.start_index

File [~/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/resolver.py:233](https://oy9kk6lny2iknrp.studio.eu-central-1.sagemaker.aws/jupyterlab/default/lab/tree/be.axa.data.exploration.pci/clio/notebooks/be.axa.data.exploration.pci/clio/notebooks/test_google_langextract/.langextract_env/lib/python3.11/site-packages/langextract/resolver.py#line=232), in Resolver.resolve(self, input_text, suppress_parse_errors, **kwargs)
    229     logging.exception(
    230         "Failed to parse input_text: %s, error: %s", input_text, e
    231     )
    232     return []
--> 233   raise ResolverParsingError("Failed to parse content.") from e
    235 processed_extractions = self.extract_ordered_extractions(extraction_data)
    237 logging.debug("Completed the resolver process.")

ResolverParsingError: Failed to parse content.

ivankeller avatar Sep 22 '25 09:09 ivankeller