LangExtract extracts content from examples instead of actual input text
Description
I'm experiencing an issue where LangExtract is extracting content from the provided examples rather than from the actual input text I want to process. This results in duplicate extractions and incorrect data being returned.
Expected Behavior
LangExtract should only extract information from the input text/document, using examples solely as guidance for the extraction format and structure.
Actual Behavior
LangExtract is generating extractions for both the examples and the input text, resulting in:
- Content from examples appearing in the final results
Questions
- Is this the intended behavior? Should examples generate their own extractions?
- How can I ensure only input text extractions are returned?
- Is there a parameter to disable example extractions?
- What do
group_indexandextraction_indexrepresent in this context?
I met the same problem: using examples based on a document A, and replicating the prompt with those examples on document B, yield results with examples from document A which are not present in document B.
The issue causes false positives.
It seems the problem occurs when multiple examples are provided (none from A present in B) and it occurs even with temperature = 0.0. (no determinism) If only one example is provided (none from A present in B) the model extract better - but how to control for false positives from the examples ?