pyserini
pyserini copied to clipboard
Add wikipedia corpus pre-processing scripts
After WikiExtractor and https://github.com/facebookresearch/DrQA/tree/main/scripts/retriever pre-processing is done on the Wikipedia XML dump, the final pre-processing is done in these scripts to generate .tsv files for each corpus variant.
Corpus Variants:
- WIKI_6_3 (passages with segment size of 6 sentences, stride of 3 sentences)
- WIKI_8_4 (passages with segment size of 8 sentences, stride of 4 sentences)
- WIKI_100w (passages with disjoint segments of 100 words)
- WIKI-TL_6_3 (passages with the addition of tables, lists, and infoboxes with segment size of 6 sentences, stride of 3 sentences)
- WIKI-TL_8_4 (passages with the addition of tables, lists, and infoboxes with segment size of 8 sentences, stride of 4 sentences)