setfit icon indicating copy to clipboard operation
setfit copied to clipboard

Hard Negative Mining vs random sampling

Open vahuja4 opened this issue 1 year ago • 2 comments

Has anyone tried doing hard negative mining when generating the sentence pairs as opposed to random sampling? @tomaarsen - is random sampling the default?

vahuja4 avatar Apr 12 '23 09:04 vahuja4

Random sampling for the negative pairs is the default, yes. My understanding is that this is a relatively hard to beat baseline. @danielkorat has done some research on different sampling approaches, and I believe he found that some of the seemingly clever sampling approaches were beaten by simple random sampling. However, I think he also found that there are some improvements to be made over purely random sampling.

I don't recall exactly if he tried finding hard negatives, but perhaps he can elaborate himself a bit, if he finds the time.

  • Tom Aarsen

tomaarsen avatar Apr 12 '23 09:04 tomaarsen

I was wondering something similar. I have a n-class case where some of the classes will likely already be well separated in the un-tuned embedding space. It would be nice to bias sampling towards the pairs where I know a priori there is likely to be confusion in the downstream classification task.

adfindlater avatar Apr 15 '23 16:04 adfindlater