beir Include enriched sparse lexical retrieval methods

First, a thank you. The paper and repo have been fantastic resources to help conversations around out-of-domain retrieval!

Second, a feature request. I think it would be very interesting to see some of the document/index enrichment approaches added to the benchmark and paper discussion, as extensions to sparse lexical retrieval. You mention both doc2query and DeepCT/HDCT in the paper but don't provide benchmark data for them. Since they are trained on MS MARCO, it would be interesting to see if they perform well out-of-domain and in-comparison to both BM25+CE and ColBERT which perform very well out-of-domain.

May 31 '21 08:05 joshdevins

Hi @joshdevins,

I also find this feature interesting and it's already planned to be added to the BEIR repository in the future.

I have started to integrate with Pyserini and we currently have Anserini-BM25 and RM3 expansion added in BEIR. Doc2query would be the next to be added to the repository and should be easy to add to the repo.

Regarding DeepCT, I would need to have a look at the original repository and check how easily it can be integrated with BEIR repo. Hopefully, it should not be difficult to integrate.

I shall update you once both methods have been added to the BEIR repository.

Kind Regards, Nandan

May 31 '21 11:05 thakur-nandan

Ok that sounds great @NThakur20. I'm definitely more interested in doc2query as it performs much better in-domain than DeepCT, so even that additional datapoint in the benchmark would be really useful.

May 31 '21 12:05 joshdevins

Hey @NThakur20, a colleague pointed me to a new paper that might also be interesting that kind of fits into the sparse lexical retrieval category. Looks like they have model checkpoints already but the indexing and retrieval is using custom indices (as far as I can tell).

COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List https://github.com/luyug/COIL

Jun 10 '21 09:06 joshdevins

Hi @joshdevins,

Thanks for mentioning the resource. The paper you mentioned was presented in NAACL and I had a long chat with the paper authors regarding the same. We are in talks to integrate the COIL paper into the BEIR repository.

Kind Regards, Nandan

Jun 12 '21 12:06 thakur-nandan

Hi @joshdevins,

I've added code to evaluate docT5query with the BEIR benchmark. You can a sample code here to run and evaluate it using Pyserini-BM25 - (link).

Kind Regards, Nandan Thakur

Jun 12 '21 14:06 thakur-nandan

I'm gonna have a look at this soon. Can I add results here for inclusion in the Google Sheets "leaderboard"?

Jun 28 '21 14:06 joshdevins

Hi @joshdevins,

Yes, I haven't been able to get time to run docT5query for all BEIR datasets, as you mentioned it takes time. Feel free to share results with docT5query on the BEIR datasets here. I would be happy to add them to the Google Sheets Leaderboard.

Also, finally, I have a working example up for DeepCT. The original DeepCT repository is quite old and only worked with TensorFlow 1.x, Had to modify it to the latest TensorFlow versions. You can find a sneak peek here: https://github.com/UKPLab/beir/blob/development/examples/retrieval/evaluation/sparse/evaluate_deepct.py

I would merge it with the main branch soon! I enjoyed your debate in Berlin Buzzwords 2021.

Kind Regards, Nandan

Jul 02 '21 16:07 thakur-nandan

I'm running most of the docT5query examples now, but I won't have access to a couple datasets since they require accepting dataset use agreements. I'll post results for what I can run here.

Jul 08 '21 12:07 joshdevins

Sounds good, Thanks @joshdevins! I look forward to the results 😊

Kind Regards, Nandan

Jul 09 '21 06:07 thakur-nandan

Results for doc2query-T5 are as follows. As mentioned above, some datasets we don't have access to due to usage restrictions so they have been excluded. baseline is defined here as the Anserini BM25 score.

dataset	baseline	score	+/-
`msmarco`	0.228	0.5064	🔼
`fever`	0.753	0.6926	🔽
`climate-fever`	0.213	0.1772	🔽
`hotpotqa`	0.603	0.5441	🔽
`dbpedia-entity`	0.313	0.3012	🔽
`nq`	0.329	0.3412	🔼
`webis-touche2020`	0.614	0.5246	🔽
`trec-covid`	0.656	0.6609	🔼
`quora`	0.742	0.7821	🔼
`cqadupstack`	0.316	0.2937	🔽
`fiqa`	0.236	0.2433	🔼
`scidocs`	0.158	0.1558	🔽
`scifact`	0.665	0.6599	🔽

I don't understand the msmarco result here though. Something seems to be off on that one. I have the pyserini.jsonl in case you want to have a look — I don't see anything obviously wrong with the index.

Some thoughts:

Every run will have a different score since we are generating queries for each run and they will not be identical between runs (if I understand correctly how the queries are being generated with T5).
Is it possible that there is a problem with a data split somewhere?
The model used castorini/doc2query-t5-base-msmarco should be the same as the one used in the publication so I don't think that accounts for the score discrepancy.
I see no title fields in the original corpus.json, although this shouldn't really affect the score like that.

Jul 09 '21 16:07 joshdevins

beir beir copied to clipboard

Include enriched sparse lexical retrieval methods

beir
beir copied to clipboard