Jimmy Lin issues

Results 211 issues of


                                            Jimmy Lin

Annotation methodology of QA resources

Hi there, thanks for sharing your QA resource! https://github.com/deepset-ai/COVID-QA/tree/master/data/question-answering I was wondering if you have a write-up of the annotation methodology? For example, how were the documents selected, how were...

question

Low BM25 baselines?

Hi there, thanks for providing this nice resource! Looking at your paper, I think your BM25 baselines are a bit low? You report 0.218 nDCG@10 on MS MARCO, if I'm...

Debug HC4 RM3 effectiveness for Russian

https://github.com/castorini/anserini/blob/master/docs/regressions-hc4-v1.0-ru.md#effectiveness RM3 only gets 0.0821 Possibly a bug? We should look into it...

Run HC4 with Lucene 9

Look at #1875 Based on my preliminary tests, Lucene 9 has better analyzers for Russian. We should try HC4 and see if it makes a difference. @ToluClassics can you try...

batchGetDocument: multi-threaded implementation is slower than single-threaded implementation

We merged this without doing a performance analysis: #1857 However, the multithreaded impl turns out to be actually slower... This was originally reported here in Pyserini https://github.com/castorini/pyserini/pull/1178 but @HAKSOAT confirmed...

Better implementation of JsonVectorCollection than the "fake words" approach

To index JsonVectorCollection sparse vectors, we currently use the "fake words" trick - just duplicate the word _X_ times, where _X_ is the score. This might be a better solution:...

Integrate Waterloo spam scores and other static priors into index

We should develop a generic mechanism to store and use Waterloo spam scores, PageRank, HITS, and other static priors. @iorixxx Do you have some code to contribute along these lines?

JsonVectorCollection: refactor to not depend on MultifieldSourceDocument

Currently, we're extending MultifieldSourceDocument, which probably shouldn't be the case.

Regression for KILT

Add SDM regression

We don't have SDM regression tested. We should fix this.