RAGatouille
RAGatouille copied to clipboard
hard negatives
Hi, some information about hard negative mining? I am a bit confused with this idea. Doesn't it assume that the training set contain a complete list of the positive answers to a query in the training set? (it may be difficult to draw the line between positive and negative results... so there is a risk of confusion between 'hard negative' and 'not positive enough' ) TIA
Yes, this is correct, all datasets used for retrieval training tend to have many more positives than actually annotated.
There's a bunch of different approaches commonly used to mitigate that. The one used in RAGatouille at the moment is a fairly basic&widespread one:
- We enforce a
min_rank
when mining, so any of the topn
results aren't considered as potential hard negatives.n
is arbitrarily/following common practices (bge embeddings) set to 10 here - We set a max_rank, in our case a naive
min(110, int(len(collection) // 10))
. - We retrieve the
max_rank
most similar documents, discard all the ones with a rank belowmin_rank
, and then randomly samplek
(by default, 10) hard negatives from the remaining results.
This doesn't completely remove the possibility that we're going to have false negatives as part of our mining, but drastically reduces it, and empirically works well. If you know your data better and know of a good way to avoid having false negs, using an alternate mining approach could be better though!
Thanks a lot! Very clear explanation