pyserini icon indicating copy to clipboard operation
pyserini copied to clipboard

Add wikipedia corpus pre-processing scripts

Open manveertamber opened this issue 2 years ago • 0 comments

After WikiExtractor and https://github.com/facebookresearch/DrQA/tree/main/scripts/retriever pre-processing is done on the Wikipedia XML dump, the final pre-processing is done in these scripts to generate .tsv files for each corpus variant.

Corpus Variants:

  • WIKI_6_3 (passages with segment size of 6 sentences, stride of 3 sentences)
  • WIKI_8_4 (passages with segment size of 8 sentences, stride of 4 sentences)
  • WIKI_100w (passages with disjoint segments of 100 words)
  • WIKI-TL_6_3 (passages with the addition of tables, lists, and infoboxes with segment size of 6 sentences, stride of 3 sentences)
  • WIKI-TL_8_4 (passages with the addition of tables, lists, and infoboxes with segment size of 8 sentences, stride of 4 sentences)

manveertamber avatar Oct 16 '22 18:10 manveertamber