beir icon indicating copy to clipboard operation
beir copied to clipboard

Training data for NQ?

Open ReyonRen opened this issue 3 years ago • 6 comments

Thanks for the great contribution!

I found that the downloaded data of NQ only contains test files and corpus, where can I get the training files?

Thank you!

ReyonRen avatar Aug 27 '21 09:08 ReyonRen

Hi @ReyonRen,

You can use this dataset for NQ training: https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/paraphrases/NQ-train_pairs.jsonl.gz

Kind Regards, Nandan

thakur-nandan avatar Aug 27 '21 09:08 thakur-nandan

Thank you very much. Is the query of this training set a subset of the passage in your open-source NQ corpus?

ReyonRen avatar Aug 27 '21 13:08 ReyonRen

Hi @ReyonRen,

Actually, the evaluation corpus is a subset of the training set. Because in NQ (original dataset) often you can have duplicated pages present, i.e. identical Wikipedia pages from let's say 2014, 2015, etc.

While creating the BEIR NQ evaluation corpus, we only evaluate a single question for a Wikipedia passage, because if we add other passages with the same title but from a different year let's say 2014 or 2015, you introduce duplicates within your dataset.

However, during training, you do not care about duplicates and train with all passage and question combinations!

Kind Regards, Nandan Thakur

thakur-nandan avatar Aug 30 '21 09:08 thakur-nandan

Thank you for the kind reply!

ReyonRen avatar Sep 06 '21 09:09 ReyonRen

Hi @NThakur20, is it possible to make the preprocessing code from jsonl to TSV available for the NQ dataset? Or if the train.tsv for NQ is available for download, that'd be helpful too.

jaxball avatar Mar 09 '22 18:03 jaxball

The tsv format is here: https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nq-train.zip

Discussed here: https://github.com/beir-cellar/beir/issues/108

mrdrozdov avatar Jun 30 '23 23:06 mrdrozdov