Jimmy Lin
Jimmy Lin
I see, the issue is that `.doc` has no "batch" version, huh? This will need to be done on the Java side... a bit more involved. We should punt on...
hi @manveertamber I believe you were going to take this on?
More details: https://github.com/castorini/onboarding/blob/master/docs/cc-guide.md The key is `searcher.doc`. As part of https://github.com/castorini/anserini/issues/1778 we now have a more efficient batch method. So, replace `.doc` with that more efficient method. The result: QA...
Also, do we have something like `QueryEncoder.from_huggingface('model')`?
Link to ACL Anthology version: https://aclanthology.org/2021.emnlp-main.227/
Agreed. This would be a nice feature. In the meantime, the janky solution is to write collection to disk and then invoke indexer via shell, see: https://github.com/castorini/pyserini/blob/master/scripts/msmarco-doc/rerank_with_bm25_passages.py
bumping this issue - @ola13 brought up a use case for this - directly indexing a hgf dataset without first writing out JSON lines... maybe we should increase in terms...
@manveertamber can you run some concrete performance measurements? E.g., + previous impl + new impl with 1, 2, 4, 8, ... threads non-linear scaling is to be expected, I think,...
Ack. Let's put this issue on pause. @HAKSOAT has confirmed that the slowdown happens on the Java end also. We're trying to diagnose why this is the case.
Unfortunately, adding this feature is a bit more involved... due to the jankiness of multi-processing, an efficient parallel implementation needs to be done on the Java side, and then with...