pygaggle icon indicating copy to clipboard operation
pygaggle copied to clipboard

Document segmentation

Open yixuan-qiao opened this issue 3 years ago • 6 comments

Before segment each document into passages by applying a sliding window of 10 sentences with a stride of five, may be some extra pre-processing method using regular expression? We simply use the NLTK package to split sentences and the obtained passages is different from the one in the released index. In some cases, semicolon is used for split, and in other cases, sentences with high number ratio were also removed, but there's a lot more that I feel I haven't taken into account.

Would you mind sharing a script for data processing? Much thanks

yixuan-qiao avatar Jul 06 '21 03:07 yixuan-qiao

Hi @yixuan-qiao

the obtained passages is different from the one in the released index.

which index are you looking at?

MXueguang avatar Jul 06 '21 03:07 MXueguang

the index we use is msmarco-doc-per-passage, command is searcher = SimpleSearcher.from_prebuilt_index('msmarco-doc-per-passage')

image

The top one is ours, and the bottom one is extracted directly from the index

yixuan-qiao avatar Jul 06 '21 03:07 yixuan-qiao

see this repo? https://github.com/castorini/docTTTTTquery

In comparison with per-passage expansion, we will use per passage without expansion as the baseline. In this method, we will not append the predicted queries to the passages.

in the docTTTTTquery repo

MXueguang avatar Jul 06 '21 04:07 MXueguang

basically we use spacy sentensizer, the spacy version should be 2.1.6 IIRC

MXueguang avatar Jul 06 '21 04:07 MXueguang

I find the data processing script and I will try immediately. Awesome memory, thanks!

yixuan-qiao avatar Jul 06 '21 04:07 yixuan-qiao

carefully read the script convert_msmarco_doc_to_t5_format.py, i found a constant 10000, 10000 characters not tokens which is small relative to the length of the document(median:584 max:333757). Maybe due to time efficiency?

for doc_id, (doc_title, doc_text) in tqdm(corpus.items(), total=len(corpus)):
    doc = nlp(doc_text[:10000])
    sentences = [sent.string.strip() for sent in doc.sents]

yixuan-qiao avatar Jul 06 '21 10:07 yixuan-qiao