Condenser Regarding the spans in the contrastive loss calculation

Regarding the spans in the contrastive loss calculation

Open hyleemindslab opened this issue 4 years ago • 3 comments

Hello,

In the paper it is stated that

... given a random list of n documents [d1, d2, ..., dn], we extract randomly from each a pair of spans, [s11, s12, ..., sn1, sn2].

I was wondering how the spans were extracted from a document. Are they sentences, each of which is split by nltk.sentence_tokenizer? Or, are they equally sized chunks extracted using a sliding window? Maybe they are the same as the Condenser pretraining blocks but annotated with a document id to which they belong?

Thank you.

Nov 17 '21 09:11 hyleemindslab

I used random non overlapping sequences. Technically what is desired here is a good passage tokenizer; you may get better performance if you can do a better job separating out the passages.

Nov 21 '21 14:11 luyug

@luyug I'm wondering how long are these spans? From what I understand, you were using $MAX_LENGTH in the scripts for setting the length. Can you share the values you used when training the models?

Dec 09 '21 22:12 eugene-yang

The length should be selected to align roughly with text lengths in your actual search task's (rounded according to your accelerator's requirement). For passage retrieval, we used 128.

Dec 10 '21 02:12 luyug

Condenser Condenser copied to clipboard

Regarding the spans in the contrastive loss calculation

Condenser
Condenser copied to clipboard