langextract icon indicating copy to clipboard operation
langextract copied to clipboard

LangExtract extracts content from examples instead of actual input text

Open ojipadeson opened this issue 4 months ago • 1 comments

Description

I'm experiencing an issue where LangExtract is extracting content from the provided examples rather than from the actual input text I want to process. This results in duplicate extractions and incorrect data being returned.

Expected Behavior

LangExtract should only extract information from the input text/document, using examples solely as guidance for the extraction format and structure.

Actual Behavior

LangExtract is generating extractions for both the examples and the input text, resulting in:

  • Content from examples appearing in the final results

Questions

  1. Is this the intended behavior? Should examples generate their own extractions?
  2. How can I ensure only input text extractions are returned?
  3. Is there a parameter to disable example extractions?
  4. What do group_index and extraction_index represent in this context?

ojipadeson avatar Aug 29 '25 03:08 ojipadeson

I met the same problem: using examples based on a document A, and replicating the prompt with those examples on document B, yield results with examples from document A which are not present in document B.

The issue causes false positives.

It seems the problem occurs when multiple examples are provided (none from A present in B) and it occurs even with temperature = 0.0. (no determinism) If only one example is provided (none from A present in B) the model extract better - but how to control for false positives from the examples ?

gg4u avatar Sep 16 '25 15:09 gg4u