Condenser
Condenser copied to clipboard
Regarding the spans in the contrastive loss calculation
Hello,
In the paper it is stated that
... given a random list of n documents [d1, d2, ..., dn], we extract randomly from each a pair of spans, [s11, s12, ..., sn1, sn2].
I was wondering how the spans were extracted from a document. Are they sentences, each of which is split by nltk.sentence_tokenizer? Or, are they equally sized chunks extracted using a sliding window? Maybe they are the same as the Condenser pretraining blocks but annotated with a document id to which they belong?
Thank you.
I used random non overlapping sequences. Technically what is desired here is a good passage tokenizer; you may get better performance if you can do a better job separating out the passages.
@luyug I'm wondering how long are these spans?
From what I understand, you were using $MAX_LENGTH
in the scripts for setting the length. Can you share the values you used when training the models?
The length should be selected to align roughly with text lengths in your actual search task's (rounded according to your accelerator's requirement). For passage retrieval, we used 128.