pygaggle
pygaggle copied to clipboard
Document segmentation
Before segment each document into passages by applying a sliding window of 10 sentences with a stride of five, may be some extra pre-processing method using regular expression? We simply use the NLTK package to split sentences and the obtained passages is different from the one in the released index. In some cases, semicolon is used for split, and in other cases, sentences with high number ratio were also removed, but there's a lot more that I feel I haven't taken into account.
Would you mind sharing a script for data processing? Much thanks
Hi @yixuan-qiao
the obtained passages is different from the one in the released index.
which index are you looking at?
the index we use is msmarco-doc-per-passage, command is searcher = SimpleSearcher.from_prebuilt_index('msmarco-doc-per-passage')
The top one is ours, and the bottom one is extracted directly from the index
see this repo? https://github.com/castorini/docTTTTTquery
In comparison with per-passage expansion, we will use per passage without expansion as the baseline. In this method, we will not append the predicted queries to the passages.
in the docTTTTTquery repo
basically we use spacy sentensizer, the spacy version should be 2.1.6 IIRC
I find the data processing script and I will try immediately. Awesome memory, thanks!
carefully read the script convert_msmarco_doc_to_t5_format.py, i found a constant 10000, 10000 characters not tokens which is small relative to the length of the document(median:584 max:333757). Maybe due to time efficiency?
for doc_id, (doc_title, doc_text) in tqdm(corpus.items(), total=len(corpus)):
doc = nlp(doc_text[:10000])
sentences = [sent.string.strip() for sent in doc.sents]