pyserini icon indicating copy to clipboard operation
pyserini copied to clipboard

Build inverted indexes on the fly

Open alexlimh opened this issue 3 years ago • 4 comments

Currently pyserini only supports building indexes on the document collections stored in the disk, but it would be nice to also support building indexes on the fly, i.e., using GPT-2 to generate documents and adding them to the index like a stream.

alexlimh avatar Jun 02 '21 16:06 alexlimh

Agreed. This would be a nice feature. In the meantime, the janky solution is to write collection to disk and then invoke indexer via shell, see: https://github.com/castorini/pyserini/blob/master/scripts/msmarco-doc/rerank_with_bm25_passages.py

lintool avatar Jun 02 '21 16:06 lintool

Agreed. This would be a nice feature. In the meantime, the janky solution is to write collection to disk and then invoke indexer via shell, see: https://github.com/castorini/pyserini/blob/master/scripts/msmarco-doc/rerank_with_bm25_passages.py

Hi! I was also wondering how to build index on the fly and came across this. However, the link seems to be broken. Is there any plan to integrate this functionality in the near future? Thanks!

velocityCavalry avatar Jul 25 '22 01:07 velocityCavalry

bumping this issue - @ola13 brought up a use case for this - directly indexing a hgf dataset without first writing out JSON lines...

maybe we should increase in terms of priority...

lintool avatar Oct 19 '22 00:10 lintool

Ref: https://github.com/castorini/anserini/pull/2016

lintool avatar Nov 11 '22 15:11 lintool