beir icon indicating copy to clipboard operation
beir copied to clipboard

Sparse query vector

Open maximedb opened this issue 2 years ago • 5 comments

This PR builds upon #62.

It refactors the sparse search to represent queries and documents as CSR matrices. The SPARTA model is updated to fit this setup.

It also adds a clean SPLADE model along with an eval code. The SPLADE authors used a DenseRetrievalExactSearch in their demo script, but as SPLADE is labeled as a sparse model it should use a SparseSearch in my opinion. The results are not directly comparable as it uses the co-condenser instead of distilbert as base model. I could not find a URL to download the link of the original model.

Maxime.

maximedb avatar Feb 09 '22 21:02 maximedb

Hi Maxime,

Thanks for integrating SPLADE using CSR matrices! I will be running it on my side and will let you know if it matches the numbers we have dataset by dataset (for the CoCodenser version).

For the "original" SPLADE model, it is available here: https://github.com/naver/splade/tree/main/weights/distilsplade_max, but as individual files. We are working into making it available as a tar.gz as well.

cadurosar avatar Feb 10 '22 08:02 cadurosar

Hi Carlos,

Thanks for integrating SPLADE using CSR matrices! I will be running it on my side and will let you know if it matches the numbers we have dataset by dataset (for the CoCodenser version).

Nice, thanks!

For the "original" SPLADE model, it is available here: https://github.com/naver/splade/tree/main/weights/distilsplade_max, but as individual files. We are working into making it available as a tar.gz as well.

Uploading the model on the HuggingFace hub would also be possible (and easier to download).

maximedb avatar Feb 10 '22 08:02 maximedb

Hi @maximedb,

Thanks again for making use of the CSR matrices for SPLADE. I would have a look at the PR and merge it with beir soon.

A mention of a side project of mine: Sparse Retrieval (https://github.com/NThakur20/sparse-retrieval). We are currently developing a ready-to-use toolkit for efficient training and inference of all neural sparse retrieval models such as SPLADE, SPARTA, uniCOIL, TILDE, and DeepImpact. The implementation of SPLADE with CSR matrices works. However, we find it better and more efficient to use an Inverted index such as Pyserini. The project is planned to come out by end of February! We will keep you updated soon.

We will reproduce the various sparse baselines and additionally upload the models on HF.

Kind Regards, Nandan Thakur

thakur-nandan avatar Feb 10 '22 13:02 thakur-nandan

Really cool stuff! The multi-gpu encoding is a super cool feature :-)

maximedb avatar Feb 10 '22 14:02 maximedb

Uploading the model on the HuggingFace hub would also be possible (and easier to download).

We would love to, but we are still seeing internally how we can do it. Here's a link for the "original model" in the same way as the new ones: https://download-de.europe.naverlabs.com/Splade_Release_Jan22/distilsplade_max.tar.gz

A mention of a side project of mine: Sparse Retrieval (https://github.com/NThakur20/sparse-retrieval). We are currently developing a ready-to-use toolkit for efficient training and inference of all neural sparse retrieval models such as SPLADE, SPARTA, uniCOIL, TILDE, and DeepImpact. The implementation of SPLADE with CSR matrices works. However, we find it better and more efficient to use an Inverted index such as Pyserini. The project is planned to come out by end of February! We will keep you updated soon.

It looks really cool Nandan, I've starred and will keep an eye on it :)

cadurosar avatar Feb 10 '22 15:02 cadurosar