graphrag
graphrag copied to clipboard
Fix/few shot selection
Description
This PR enforces a minimum of 3 few shot examples used in the final entity & relation extraction prompts. It also adds a KNN, content-based selection method for few shot examples to prevent edge cases where random sampling selects noisy, non-representative examples.
Proposed Changes
- Enforced minimum 3 examples used in generator/entity_extraction_prompt.py
- Added new sampling method in loader/input.py
- I've renamed some objects to reflect the change from
finetune->prompt_tune
Additional Notes
Research required for additional improvements:
We could theoretically fit some clusters over the embedding space and sample from each cluster independently if we suspect there are independent distributions in the space worth sampling from independently. We would have to make some strong assumptions about what constitutes a "topic" and the geometry of topics in embedding space, which makes this a challenging problem.