pyserini icon indicating copy to clipboard operation
pyserini copied to clipboard

Add multi-thread support for scripts/dpr/convert_trec_run_to_retrieval_json.py

Open manveertamber opened this issue 2 years ago • 3 comments

For https://github.com/castorini/pyserini/issues/370

Controlling for everything else it seems using searcher.batch_doc is slower than using searcher.doc. That is to say, I have found using more than one thread leads to this script running slower. However, increasing the number of threads beyond 2 does seem to help.

manveertamber avatar May 18 '22 15:05 manveertamber

@manveertamber can you run some concrete performance measurements? E.g.,

  • previous impl
  • new impl with 1, 2, 4, 8, ... threads

non-linear scaling is to be expected, I think, since there is inevitably contention for the underlying data structures...

lintool avatar May 18 '22 16:05 lintool

@manveertamber can you run some concrete performance measurements? E.g.,

* previous impl

* new impl with 1, 2, 4, 8, ... threads

non-linear scaling is to be expected, I think, since there is inevitably contention for the underlying data structures...

@lintool Running this on orca:

nohup python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \
    --index wikipedia-dpr \
    --topics dpr-nq-test \
    --input runs/run.dpr.nq-test.bm25.trec \
    --output runs/run.dpr.nq-test.bm25.json \
  • previous impl: 19 minutes 13 seconds

  • new impl: 19:16 (1 thread, uses searcher.doc) 52:11 (2 threads) 43:54 (4 threads) 37:42 (8 threads)

manveertamber avatar May 18 '22 23:05 manveertamber

Ack. Let's put this issue on pause.

@HAKSOAT has confirmed that the slowdown happens on the Java end also. We're trying to diagnose why this is the case.

lintool avatar May 23 '22 21:05 lintool