learned-sparse-retrieval
learned-sparse-retrieval copied to clipboard
Unified Learned Sparse Retrieval Framework
LSR: A unified framework for efficient and effective learned sparse retrieval
The framework provides a simple yet effective toolkit for defining, training, and evaluating learned sparse retrieval methods. The framework is composed of standalone modules, allowing for easy mixing and matching of different modules or integration with your own implementation. This provides flexibility to experiment and customize the retrieval model to meet your specific needs.
The structure of the lsr package is as following:
├── configs #configuration of different components
│ ├── dataset
│ ├── experiment #define exp details: dataset, loss, model, hp
│ ├── loss
│ ├── model
│ └── wandb
├── datasets #implementations of dataset loading & collator
├── losses #implementations of different losses + regularizer
├── models #implementations of different models
├── tokenizer #a wrapper of HF's tokenizers
├── trainer #trainer for training
└── utils #common utilities used in different places
-
The list of all configurations used in the paper could be found here
-
The instruction for running experiments could be found here
Training and inference instructions
1. Create conda environment and install dependencies:
Create conda environemt:
conda create --name lsr python=3.9.12
conda activate lsr
Install dependencies with pip
pip install -r requirements.txt
2. Downwload/Prepare datasets
We have included all pre-defined dataset configurations under lsr/configs/dataset. Before starting training, ensure that you have the ir_datasets and (huggingface) datasets libraries installed, as the framework will automatically download and store the necessary data to the correct directories.
For datasets from ir_datasets, the downloaded files are saved by default at ~/.ir_datasets/. You can modify this path by changing the IR_DATASETS_HOME environment variable.
Similarly, for datasets from the HuggingFace's datasets, the downloaded files are stored at ~/.cache/huggingface/datasets by default. To specify a different cache directory, set the HF_DATASETS_CACHE environment variable.
To train a customed model on your own dataset, please use the sample configurations under lsr/config/dataset as templates. Overall, you need three important files (see lsr/dataset_utils for the file format):
- document collection: maps
document_idtodocument_text - queries: maps
query_idtoquery_text - train triplets or scored pairs:
- train triplets, used for contrastive learning, contains a list of <
query_id,positive_document_id,negative_document_id> triplets. - scored_pairs, used for distillation training, contain pairs of <
query,document_id> with a relevance score.
- train triplets, used for contrastive learning, contains a list of <
3. Train a model
To train a LSR model, you can just simply run the following command:
python -m lsr.train +experiment=sparta_msmarco_distil \
training_arguments.fp16=True
Please note that:
- In this command,
sparta_msmarco_distilrefers to the experiment configuration file located atlsr/configs/experiment/sparta_msmarco_distil.yaml. If you wish to use a different experiment, simply change this value to the name of the desired configuration file underlsr/configs/experiment. - You may notice a
+beforeexperiment=sparta_msmarco_distil. This is a convention in Hydra to add a new configuration key (in this case,experiment) that is not yet defined in lsr/configs/config.yaml. If you want to override an existing key (e.g.,training_arguments.fp16), you don't need to use the+symbol - We trained some models using NVIDIA A100 80GB, allowing us to use large batch sizes (e.g., 128). To replicate our experiments on smaller GPUs, reduce the batch size and increase the gradient accumulation steps (e.g., add
training_arguments.per_device_train_batch_size=64 +training_arguments.gradient_accumulation_steps=2to your training command). Note: With models (e.g., Splade) using sparse regularizers during training, the results may still differ slightly since we don't take accumulation steps into account for adjusting regularization weights. - We use
wandb(by default) to monitor the training process, including loss, regularization, query length, and document length. If you wish to disable this feature, you can do so by addingtraining_arguments.report_to='none'to the above command. Alternatively, you can follow the instructions here to set up wandb.
4. Run inference on MSMARCO dataset
When the training finished, you can use our inference scripts to generate new queries and documents as following:
4.1 Generate queries
input_path=data/msmarco/dev_queries/raw.tsv
output_file_name=raw.tsv
batch_size=256
type='query'
python -m lsr.inference \
inference_arguments.input_path=$input_path \
inference_arguments.output_file=$output_file_name \
inference_arguments.type=$type \
inference_arguments.batch_size=$batch_size \
inference_arguments.scale_factor=100 \
+experiment=sparta_msmarco_distil
4.2 Generate documents
input_path=data/msmarco/full_collection/split/part01
output_file_name=part01
batch_size=256
type='doc'
python -m lsr.inference \
inference_arguments.input_path=$input_path \
inference_arguments.output_file=$output_file_name \
inference_arguments.type=$type \
inference_arguments.batch_size=$batch_size \
inference_arguments.scale_factor=100 \
inference_arguments.top_k=-400 \
+experiment=sparta_msmarco_distil \
Note:
- The
top_kargument is the number of terms you want to keep; negativetop_kmeans no pruning (all positive terms are kept). scale_factoris used for weight quantization; float weights are multiplied by thisscale_factorand rounded to the nearest integer.- The inference in document collection will take a long time. Therefore, it is better to split the collection into multiple partitions and run inference using multiple GPUs.
- All the generated queries and documents are stored in the
output/{exp_name}/inference/directory by default, where theexp_nameparameter is defined in the experiment configuration file. You can change it as you like.
5. Index generated documents
5.1 Download and install our modified Anserini indexing software:
We made simple changes in the indexing procedure in Anserini to improve the indexing speed (by 10x).
In the old method, Anserini first creates fake documents from JSON weight files (e.g., {"hello": 3}) by repeating the term (e.g., "helo hello hello") and then indexes these documents as regular documents. The process of creating these fake documents can cause a substantial delay in indexing LSR where the number of terms and weights are usually large. To get rid of this issue, we leverage the FeatureField in Lucene to inject the (term, weight) pairs directly to the index. The change is simple but quite effective, especially when you have to index multiple times (as in the paper).
You can download the modified Anserini version here, then follow the instructions in the README for installation. If the tests fail, you can skip it by adding -Dmaven.test.skip=true.
When the installation is done, you can continue with the next steps.
5.2 Index with Anserini
./anserini-lsr/target/appassembler/bin/IndexCollection \
-collection JsonSparseVectorCollection \
-input outputs/sparta_distil_sentence_transformers/inference/doc/ \
-index outputs/sparta_distil_sentence_transformers/index \
-generator SparseVectorDocumentGenerator \
-threads 60 -impact -pretokenized
Note that you have to change sparta_distil_sentence_transformers to the output defined in your experiment configuation flie (here: lsr/configs/experiment/sparta_msmarco_distil.yaml)
6. Search on the Inverted Index
./anserini-lsr/target/appassembler/bin/SearchCollection \
-index outputs/sparta_distil_sentence_transformers/index/ \
-topics outputs/sparta_distil_sentence_transformers/inference/query/raw.tsv \
-topicreader TsvString \
-output outputs/sparta_distil_sentence_transformers/run.trec \
-impact -pretokenized -hits 1000 -parallelism 60
Here, you may need to change the output directory as in 5.2.
7. Evaluate the run file
ir_measures qrels.msmarco-passage.dev-subset.txt outputs/sparta_distil_sentence_transformers/run.trec MRR@10 R@1000 NDCG@10
qrels.msmarco-passage.dev-subset.txt is the qrels file for MSMARCO-dev in TREC format. You can find it on the MSMARCO or TREC DL(19,20) website. Note that for TREC DL (19,20), you have to change R@1000 to "R(rel=2)@1000" (with the quote).
List of configurations used in the paper
- RQ1: Are the results from recent LSR papers reproducible?
Results in Table 3 are the outputs of following experiments:
| Method | Configuration |
|---|---|
| DeepCT | lsr/configs/experiment/deepct_msmarco_term_level.yaml |
| uniCOIL | lsr/configs/experiment/unicoil_msmarco_multiple_negative.yaml |
| uniCOILdT5q | lsr/configs/experiment/unicoil_doct5query_msmarco_multiple_negative.yaml |
| uniCOILtilde | lsr/configs/experiment/unicoil_tilde_msmarco_multiple_negative.yaml |
| EPIC | lsr/configs/experiment/epic_original.yaml |
| DeepImpact | lsr/configs/experiment/deep_impact_original.yaml |
| TILDEv2 | lsr/configs/experiment/tildev2_msmarco_multiple_negative.yaml |
| Sparta | lsr/configs/experiment/sparta_original.yaml |
| Splademax | lsr/configs/experiment/splade_msmarco_multiple_negative.yaml |
| distilSplademax | lsr/configs/experiment/splade_msmarco_distil_flops_0.1_0.08.yaml |
- RQ2: How do LSR methods perform with recent advanced training techniques?
Results in Table 4 are the outputs of following experiments:
| Method | Configuration |
|---|---|
| uniCOIL | lsr/configs/experiment/unicoil_msmarco_distil.yaml |
| uniCOILdT5q | lsr/configs/experiment/unicoil_doct5query_msmarco_distil.yaml |
| uniCOILtilde | lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml |
| EPIC | lsr/configs/experiment/epic_msmarco_distil.yaml |
| DeepImpact | lsr/configs/experiment/deep_impact_msmarco_distil.yaml |
| TILDEv2 | lsr/configs/experiment/tildev2_msmarco_distil.yaml |
| Sparta | lsr/configs/experiment/sparta_msmarco_distil.yaml |
| distilSplademax | lsr/configs/experiment/splade_msmarco_distil_flops_0.1_0.08.yaml |
| distilSpladesep | lsr/configs/experiment/splade_asm_msmarco_distil_flops_0.1_0.08.yaml |
- RQ3: How does the choice of encoder architecture and regularization affect results?
Results in Table 5 are the outputs of following experiments:
- MSMARCO Passage
| Effect | Row | Configuration |
|---|---|---|
| Doc weighting | 1a | Before: lsr/configs/experiment/splade_asm_dbin_msmarco_distil.yaml After: lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml |
| 1b | Before: lsr/configs/experiment/unicoil_dbin_tilde_msmarco_distil.yaml After: lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml |
|
| Query weighting | 2a | Before: lsr/configs/experiment/tildev2_msmarco_distil.yaml After: lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml |
| 2b | Before: lsr/configs/experiment/epic_qbin_msmarco_distil.yaml After: lsr/configs/experiment/epic_msmarco_distil.yaml |
|
| Doc expansion | 3a | Before: lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml After: lsr/configs/experiment/splade_asm_msmarco_distil_flops_0.1_0.08.yaml |
| 3b | Before: lsr/configs/experiment/unicoil_msmarco_distil.yaml After: lsr/configs/experiment/splade_asm_qmlp_msmarco_distil_flops_0.0_0.08.yaml |
|
| Query expansion | 4a | Before: splade_asm_qmlp_msmarco_distil_flops_0.0_0.08.yaml After: lsr/configs/experiment/splade_asm_msmarco_distil_flops_0.1_0.08.yaml |
| 4b | Before: lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml After: lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml |
|
| Regularization | 5a | Before: lsr/configs/experiment/splade_asm_qmlp_msmarco_distil_flops_0.0_0.08.yaml After: lsr/configs/experiment/splade_asm_qmlp_msmarco_distil_flops_0.0_0.00.yaml |
- Tripclick
| Effect | Row | Configuration |
|---|---|---|
| Doc weighting | 1a | Before: lsr/configs/experiment/qmlp_dbin_tripclick_multiple_negative.yaml After: lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml |
| 1b | Before: lsr/configs/experiment/qmlp_dexpbin_tripclick_multiple_negative.yaml After: lsr/configs/experiment/unicoil_tilde_tripclick_multiple_negative.yaml |
|
| Query weighting | 2a | Before: lsr/configs/experiment/sparta_tripclick_multiple_negative.yaml After: lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_0.0_0.0.yaml |
| 2b | Before: lsr/configs/experiment/qbin_dmlp_tripclick_multiple_negative.yaml After: lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml |
|
| Doc expansion | 3a | Before: lsr/configs/experiment/qmlm_dmlp_tripclick_hard_negative_l1_0.001.yaml After: lsr/configs/experiment/splade_asm_tripclick_multiple_negative_l1_0.001_0.00001.yaml |
| 3b | Before: lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml After: lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_l1_0.0_0.00001.yaml |
|
| Query expansion | 4a | Before: lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_l1_0.0_0.00001.yaml After: lsr/configs/experiment/splade_asm_tripclick_multiple_negative_l1_0.001_0.00001.yaml |
| 4b | Before: lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml After: lsr/configs/experiment/qmlm_dmlp_tripclick_hard_negative_l1_0.001.yaml |
|
| Regularization | 5a | Before: lsr/configs/experiment/epic_tripclick_multiple_negative.yaml After: lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_l1_0.0_0.00001.yaml |
Citing and Authors
If you find this repository helpful, feel free to cite our paper A Unified Framework for Learned Sparse Retrieval
@inproceedings{nguyen2023unified,
title={A Unified Framework for Learned Sparse Retrieval},
author={Nguyen, Thong and MacAvaney, Sean and Yates, Andrew},
booktitle={Advances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2--6, 2023, Proceedings, Part III},
pages={101--116},
year={2023},
organization={Springer}
}