langextract Add cross-chunk context awareness to prevent information loss during document chunking

Description:

⚠️ AI-Generated Issue Disclaimer: This issue was identified and generated using generative AI tools. The problem analysis and proposed solutions have not been manually tested or verified. Please validate the issue description and proposed solutions before implementation.

Problem Statement

The current _annotate_documents_single_pass implementation processes document chunks independently without considering context from previous chunks. This leads to significant information loss and extraction quality degradation, particularly for:

Coreference resolution (pronouns like "she", "he", "it")
Entity disambiguation (partial names in later chunks)
Cross-chunk relationships (entities and relationships spanning multiple chunks)
Context-dependent extractions (entities that only make sense with full context)

Current Behavior

# Each chunk is processed in isolation
for text_chunk in batch:
    batch_prompts.append(
        self._prompt_generator.render(
            question=text_chunk.chunk_text,  # Only current chunk
            additional_context=text_chunk.additional_context,  # Only doc-level context
        )
    )

Example Problem

Document: "Dr. Sarah Johnson is a cardiologist at Mayo Clinic. She specializes in heart surgery. Dr. Johnson has 15 years of experience."

Chunk 1: "Dr. Sarah Johnson is a cardiologist at Mayo Clinic."

Extracts: {"name": "Dr. Sarah Johnson", "profession": "cardiologist", "hospital": "Mayo Clinic"}

Chunk 2: "She specializes in heart surgery. Dr. Johnson has 15 years of experience."

Extracts: {"specialization": "heart surgery", "experience": "15 years"}
Lost: Connection between "She"/"Dr. Johnson" and "Dr. Sarah Johnson"

Proposed Solutions

Option 1: Sliding Window Context

Option 2: Entity Tracking

Option 3: Overlapping Chunks

Option 4: Post-Processing Coreference Resolution

Sep 06 '25 17:09 Dhano

@aksg87

Sep 07 '25 18:09 Dhano

+1 to this issue.

For example, trying to extract knowledge from wikipedia articles (as an easy source of real world data that the LLMs have all been very well trained on), it loses the coreferences in across chunks. An easy test is to have a "unique" id attribute be assigned by the model ... the ids will be reused across chunks.

The workflow that we have used in this space is pre-processing coreferences with a sliding window. It is very rare that a forward reference crosses chunks, so if you feed the information from the prior chunks in order, the coreferences are resolved correctly. This should include first, second and third person pronouns, both singular and plural (i, you, him/her, we, you plural, they) as well as place (here, there), time (now, 'then') and object (it).

Sep 22 '25 18:09 azaroth42