Fix/few shot selection

Open j2whiting opened this issue 1 year ago • 0 comments

Description

This PR enforces a minimum of 3 few shot examples used in the final entity & relation extraction prompts. It also adds a KNN, content-based selection method for few shot examples to prevent edge cases where random sampling selects noisy, non-representative examples.

Proposed Changes

Enforced minimum 3 examples used in generator/entity_extraction_prompt.py
Added new sampling method in loader/input.py
I've renamed some objects to reflect the change from finetune -> prompt_tune

Additional Notes

Research required for additional improvements:

We could theoretically fit some clusters over the embedding space and sample from each cluster independently if we suspect there are independent distributions in the space worth sampling from independently. We would have to make some strong assumptions about what constitutes a "topic" and the geometry of topics in embedding space, which makes this a challenging problem.

Jul 12 '24 15:07 j2whiting