Jimmy Lin
Jimmy Lin
Hi @Timoeller - Thanks for your response. We've been working on building test collections also, but via slightly different approach: https://arxiv.org/abs/2004.11339 I was wondering if you'd be interested in more...
What's your email? Or you can find mine on my website: https://cs.uwaterloo.ca/~jimmylin/index.html
You can find Anserini regressions for BEIR here: https://github.com/castorini/anserini#regressions-for-beir-v100 You can reuse the tuning script implemented here: https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage.md#bm25-tuning
Unless there are scientifically interesting questions you want to explore, I would advocate just giving up on indexing and letting Anserini/Lucene do it for you via CIFF. Why waste engineering...
Also, we're one issue away from the entire indexing pipeline in Anserini from being pip installable: https://github.com/castorini/pyserini/issues/77 Something like: ``` $ pip install pyserini ... $ python -m pyserini.index --collection...
@wxp16 @richard3983 maybe you'd be interested in taking on?
The error message seems pretty informative - have you checked the length of your input samples?
Oh nice! For example: https://api.semanticscholar.org/v1/paper/ACL:J07-1005 for https://www.aclweb.org/anthology/J07-1005/
Even better, there's a dump: https://api.semanticscholar.org/corpus/download/
wow, what an obscure bug! How about we just drop all terms longer than 255 chars? They are unlikely to be meaningful anyway?