dhr
dhr copied to clipboard
Dense hybrid representations for text retrieval
Dense Hybrid Retrieval
In this repo, we introduce two approaches to training transformers to capture semantic and lexical text representations for robust dense passage retrieval.
- Aggretriever: A Simple Approach to Aggregate Textual Representation for Robust Dense Passage Retrieval Sheng-Chieh Lin, Minghan Li and Jimmy Lin. (TACL just accepted)
- A Dense Representation Framework for Lexical and Semantic Matching Sheng-Chieh Lin and Jimmy Lin. (TOIS 2021 in press)
This repo contains three parts: (1) densify (2) training (tevatron) (3) retrieval. Our training code is mainly from Tevatron with a minor revision.
Requirements
pip install torch>=1.7.0
pip install transformers==4.15.0
pip install pyserini
pip install beir
Huggingface Checkpoints
| Model | Initialization | MARCO Dev | BEIR (13 public datasets) | Huggingface Path | Document |
|---|---|---|---|---|---|
| DeLADE+[CLS] plus | distilbert-base-uncased | 37.1 | 49.8 | jacklin/DeLADE-CLS-P | Read Me |
| DeLADE+[CLS] | distilbert-base-uncased | 35.7 | 48.5 | jacklin/DeLADE-CLS | Read Me |
| Aggretriever | distilbert-base-uncased | 34.1 | 46.0 | jacklin/DistilBERT-AGG | Read Me |
Aggretriever
In this paper, we introduce a simple approach to aggregating token-level information into a single-vector dense representation. We provide instruction for model training and evaluation on MS MARCO passage ranking dataset in the document. We also provide instruction for the evaluation on BEIR datasets in the document.
A Dense Representation Framework for Lexical and Semantic Matching
In this paper, we introduce a unified representation framework for Lexical and Semantic Matching. We first introduce how to use our framework to conduct retrieval for high-dimensional (lexcial) representations and combine with single-vector dense (semantic) representations for hybrid search.
Dense Lexical Retrieval
We can densify any existing lexical matching models and conduct lexical matching on GPU. In the document, we demonstrate how to conduct BM25 and uniCOIL end-to-end retrieval under our framework. Detailed description can be found in our paper.
Dense Hybrid Retrieval
With the densified lexical representations, we can easily conduct lexical and semantic hybrid retrieval using independent neural models. A document for hybrid retrieval will be coming soon.
Dense Hybrid Representation Model
In our paper, we propose a single model fusion approach by training the lexical and semantic components of a transformer while inference, we combine the densified lexical representations and dense representations as dense hybrid representations. Instead of training by yourself, you can also download our trained DeLADE-CLS-P, DeLADE-CLS and DeLADE and directly peform inference on MSMARCO Passage dataset (see document) or BEIR datasets (see document).