pyserini
pyserini copied to clipboard
RM3 and batch search
Hi, I was trying to do an experiment with retrieval based on interpolation of the query model and the relevance model (so-called 'RM3'), and I 've noticed a not well understood error.
I use JDK 11 (from Oracle) and pyserini==0.13.0.
- I build an index like this
python -m pyserini.index -collection JsonCollection \
-generator DefaultLuceneDocumentGenerator \
-threads 48 \
-input ... \
-index ... \
-storePositions -storeDocvectors -storeRaw
- I initialize a SimpleSearcher, and turn on RM3
searcher = SimpleSearcher(config['index_path'])
searcher.set_rm3(10, 10, 0.5)
- I run the batch search and obtain the error:
qids = [f"{i}" for i in range(len(queries))]
hits = searcher.batch_search(queries, qids=qids, k=K_extract, threads=threads)
Causes:
Exception in thread "pool-X-thread-Y" java.lang.NullPointerException: Cannot invoke "String.length()" because "s" is null
at java.base/java.io.StringReader.<init>(StringReader.java:51)
at io.anserini.analysis.AnalyzerUtils.analyze(AnalyzerUtils.java:39)
at io.anserini.rerank.lib.Rm3Reranker.rerank(Rm3Reranker.java:80)
at io.anserini.rerank.RerankerCascade.run(RerankerCascade.java:64)
at io.anserini.search.SimpleSearcher.search(SimpleSearcher.java:594)
at io.anserini.search.SimpleSearcher.search(SimpleSearcher.java:573)
at io.anserini.search.SimpleSearcher.lambda$batchSearchFields$0(SimpleSearcher.java:486)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
- Running the retrieval without batch_search seems to be successful (but significantly slower :(.).
hits = {str(i): searcher.search(q, k=K_extract) for i, q in enumerate(queries)}
I suppose I am not doing any kind of error? Or am I?
Cheers, Martin
@MFajcik thanks for filing this!
It seems like a bug, likely because AnalyzerUtils.analyze
is not thread safe? We'll look into it, but can't guarantee when we'll get to it...
Hi @MFajcik - I'm working on a potential fix here: https://github.com/castorini/anserini/pull/1992
Merged into Anserini main trunk. Should work in Pyserini now - TODO: add test case.