Condenser icon indicating copy to clipboard operation
Condenser copied to clipboard

Regarding the spans in the contrastive loss calculation

Open hyleemindslab opened this issue 3 years ago • 3 comments

Hello,

In the paper it is stated that

... given a random list of n documents [d1, d2, ..., dn], we extract randomly from each a pair of spans, [s11, s12, ..., sn1, sn2].

I was wondering how the spans were extracted from a document. Are they sentences, each of which is split by nltk.sentence_tokenizer? Or, are they equally sized chunks extracted using a sliding window? Maybe they are the same as the Condenser pretraining blocks but annotated with a document id to which they belong?

Thank you.

hyleemindslab avatar Nov 17 '21 09:11 hyleemindslab

I used random non overlapping sequences. Technically what is desired here is a good passage tokenizer; you may get better performance if you can do a better job separating out the passages.

luyug avatar Nov 21 '21 14:11 luyug

@luyug I'm wondering how long are these spans? From what I understand, you were using $MAX_LENGTH in the scripts for setting the length. Can you share the values you used when training the models?

eugene-yang avatar Dec 09 '21 22:12 eugene-yang

The length should be selected to align roughly with text lengths in your actual search task's (rounded according to your accelerator's requirement). For passage retrieval, we used 128.

luyug avatar Dec 10 '21 02:12 luyug