pyserini
pyserini copied to clipboard
Build inverted indexes on the fly
Currently pyserini only supports building indexes on the document collections stored in the disk, but it would be nice to also support building indexes on the fly, i.e., using GPT-2 to generate documents and adding them to the index like a stream.
Agreed. This would be a nice feature. In the meantime, the janky solution is to write collection to disk and then invoke indexer via shell, see: https://github.com/castorini/pyserini/blob/master/scripts/msmarco-doc/rerank_with_bm25_passages.py
Agreed. This would be a nice feature. In the meantime, the janky solution is to write collection to disk and then invoke indexer via shell, see: https://github.com/castorini/pyserini/blob/master/scripts/msmarco-doc/rerank_with_bm25_passages.py
Hi! I was also wondering how to build index on the fly and came across this. However, the link seems to be broken. Is there any plan to integrate this functionality in the near future? Thanks!
bumping this issue - @ola13 brought up a use case for this - directly indexing a hgf dataset without first writing out JSON lines...
maybe we should increase in terms of priority...
Ref: https://github.com/castorini/anserini/pull/2016