pyserini icon indicating copy to clipboard operation
pyserini copied to clipboard

RM3 and batch search

Open MFajcik opened this issue 3 years ago • 3 comments

Hi, I was trying to do an experiment with retrieval based on interpolation of the query model and the relevance model (so-called 'RM3'), and I 've noticed a not well understood error.

I use JDK 11 (from Oracle) and pyserini==0.13.0.

  1. I build an index like this
python -m pyserini.index -collection JsonCollection \
                         -generator DefaultLuceneDocumentGenerator \
                         -threads 48 \
                         -input ... \
                         -index ... \
                         -storePositions -storeDocvectors -storeRaw
  1. I initialize a SimpleSearcher, and turn on RM3
searcher = SimpleSearcher(config['index_path'])
searcher.set_rm3(10, 10, 0.5)
  1. I run the batch search and obtain the error:
qids = [f"{i}" for i in range(len(queries))]
hits = searcher.batch_search(queries, qids=qids, k=K_extract, threads=threads)

Causes:

Exception in thread "pool-X-thread-Y" java.lang.NullPointerException: Cannot invoke "String.length()" because "s" is null
	at java.base/java.io.StringReader.<init>(StringReader.java:51)
	at io.anserini.analysis.AnalyzerUtils.analyze(AnalyzerUtils.java:39)
	at io.anserini.rerank.lib.Rm3Reranker.rerank(Rm3Reranker.java:80)
	at io.anserini.rerank.RerankerCascade.run(RerankerCascade.java:64)
	at io.anserini.search.SimpleSearcher.search(SimpleSearcher.java:594)
	at io.anserini.search.SimpleSearcher.search(SimpleSearcher.java:573)
	at io.anserini.search.SimpleSearcher.lambda$batchSearchFields$0(SimpleSearcher.java:486)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
  1. Running the retrieval without batch_search seems to be successful (but significantly slower :(.).
hits = {str(i): searcher.search(q, k=K_extract) for i, q in enumerate(queries)}

I suppose I am not doing any kind of error? Or am I?

Cheers, Martin

MFajcik avatar Oct 21 '21 16:10 MFajcik

@MFajcik thanks for filing this!

It seems like a bug, likely because AnalyzerUtils.analyze is not thread safe? We'll look into it, but can't guarantee when we'll get to it...

lintool avatar Oct 21 '21 20:10 lintool

Hi @MFajcik - I'm working on a potential fix here: https://github.com/castorini/anserini/pull/1992

lintool avatar Oct 14 '22 12:10 lintool

Merged into Anserini main trunk. Should work in Pyserini now - TODO: add test case.

lintool avatar Oct 17 '22 23:10 lintool