BLINK icon indicating copy to clipboard operation
BLINK copied to clipboard

Training data for Cross-Encoder model

Open shzamanirad opened this issue 3 years ago • 2 comments

Hi,

It is mentioned in the paper that:

We train our cross-encoder model based on the top 100 retrieved results from our bi-encoder model on Wikipedia data. For the training of the cross-encoder model, we further down-sample our training data to obtain a training set of 1M examples.

Can you please provide this training data for cross-encoder?

Thanks.

shzamanirad avatar Aug 18 '21 03:08 shzamanirad

@shzamanirad: The training data for crossencoder is output by eval_biencoder.py script. For every datapoint in train/test/valid split, the eval script basically outputs top 64 retrieved candidates and calculates recall at various positions (recall@1, @10, @64, etc.).

The top 64 retrieved candidates are further used by the corssencoder as train/test/valid data.

abhinavkulkarni avatar May 13 '22 17:05 abhinavkulkarni

Hi , I had a doubt, If we train the cross encoder model with 64 candidates , can we eval the model with 20 or 30 candidates ?

atulbunkar avatar Jun 02 '23 09:06 atulbunkar