RAGatouille icon indicating copy to clipboard operation
RAGatouille copied to clipboard

hard negatives

Open fpservant opened this issue 1 year ago • 2 comments

Hi, some information about hard negative mining? I am a bit confused with this idea. Doesn't it assume that the training set contain a complete list of the positive answers to a query in the training set? (it may be difficult to draw the line between positive and negative results... so there is a risk of confusion between 'hard negative' and 'not positive enough' ) TIA

fpservant avatar Jan 31 '24 17:01 fpservant

Yes, this is correct, all datasets used for retrieval training tend to have many more positives than actually annotated.

There's a bunch of different approaches commonly used to mitigate that. The one used in RAGatouille at the moment is a fairly basic&widespread one:

  • We enforce a min_rank when mining, so any of the top n results aren't considered as potential hard negatives. n is arbitrarily/following common practices (bge embeddings) set to 10 here
  • We set a max_rank, in our case a naive min(110, int(len(collection) // 10)).
  • We retrieve the max_rank most similar documents, discard all the ones with a rank below min_rank, and then randomly sample k (by default, 10) hard negatives from the remaining results.

This doesn't completely remove the possibility that we're going to have false negatives as part of our mining, but drastically reduces it, and empirically works well. If you know your data better and know of a good way to avoid having false negs, using an alternate mining approach could be better though!

bclavie avatar Jan 31 '24 19:01 bclavie

Thanks a lot! Very clear explanation

fpservant avatar Jan 31 '24 19:01 fpservant