pyserini
pyserini copied to clipboard
Add multi-thread support for scripts/dpr/convert_trec_run_to_retrieval_json.py
For https://github.com/castorini/pyserini/issues/370
Controlling for everything else it seems using searcher.batch_doc is slower than using searcher.doc. That is to say, I have found using more than one thread leads to this script running slower. However, increasing the number of threads beyond 2 does seem to help.
@manveertamber can you run some concrete performance measurements? E.g.,
- previous impl
- new impl with 1, 2, 4, 8, ... threads
non-linear scaling is to be expected, I think, since there is inevitably contention for the underlying data structures...
@manveertamber can you run some concrete performance measurements? E.g.,
* previous impl * new impl with 1, 2, 4, 8, ... threads
non-linear scaling is to be expected, I think, since there is inevitably contention for the underlying data structures...
@lintool Running this on orca:
nohup python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \
--index wikipedia-dpr \
--topics dpr-nq-test \
--input runs/run.dpr.nq-test.bm25.trec \
--output runs/run.dpr.nq-test.bm25.json \
-
previous impl: 19 minutes 13 seconds
-
new impl: 19:16 (1 thread, uses searcher.doc) 52:11 (2 threads) 43:54 (4 threads) 37:42 (8 threads)
Ack. Let's put this issue on pause.
@HAKSOAT has confirmed that the slowdown happens on the Java end also. We're trying to diagnose why this is the case.