Jimmy Lin comments

Results 263 comments of


                                            Jimmy Lin

Multi-thread scripts/dpr/convert_trec_run_to_retrieval_json.py

I see, the issue is that `.doc` has no "batch" version, huh? This will need to be done on the Java side... a bit more involved. We should punt on...

Multi-thread scripts/dpr/convert_trec_run_to_retrieval_json.py

hi @manveertamber I believe you were going to take this on?

Multi-thread scripts/dpr/convert_trec_run_to_retrieval_json.py

More details: https://github.com/castorini/onboarding/blob/master/docs/cc-guide.md The key is `searcher.doc`. As part of https://github.com/castorini/anserini/issues/1778 we now have a more efficient batch method. So, replace `.doc` with that more efficient method. The result: QA...

Simplify initialization of prebuilt dense indexes

Also, do we have something like `QueryEncoder.from_huggingface('model')`?

Update DPR compression paper from EMNLP 2021

Link to ACL Anthology version: https://aclanthology.org/2021.emnlp-main.227/

Build inverted indexes on the fly

Agreed. This would be a nice feature. In the meantime, the janky solution is to write collection to disk and then invoke indexer via shell, see: https://github.com/castorini/pyserini/blob/master/scripts/msmarco-doc/rerank_with_bm25_passages.py

Build inverted indexes on the fly

bumping this issue - @ola13 brought up a use case for this - directly indexing a hgf dataset without first writing out JSON lines... maybe we should increase in terms...

Add multi-thread support for scripts/dpr/convert_trec_run_to_retrieval_json.py

@manveertamber can you run some concrete performance measurements? E.g., + previous impl + new impl with 1, 2, 4, 8, ... threads non-linear scaling is to be expected, I think,...

Add multi-thread support for scripts/dpr/convert_trec_run_to_retrieval_json.py

Ack. Let's put this issue on pause. @HAKSOAT has confirmed that the slowdown happens on the Java end also. We're trying to diagnose why this is the case.

Efficiently compute BM25 scores between a collection of queries and documents

Unfortunately, adding this feature is a bit more involved... due to the jankiness of multi-processing, an efficient parallel implementation needs to be done on the Java side, and then with...