ColBERT
ColBERT copied to clipboard
Sample code training and dataset of ColBert V2
hope you release the training code of colbert v2
Thanks for your interest! Yes, we'll release instructions for ColBERTv2 soon. (FWIW, most of the code is already there: you just need to provide distillation scores in the triples jsonl files.)
I'll update here once that's done.
By the way, what did you mean by "dataset" here? All raw data we've used is public (e.g., LoTTE, MS MARCO, BEIR)
At least on our side (we are also interested on this :) would be the exact files or code to download and generate them (so in the case of ColBERTv2 the MSMARCO triplets for the "first training" and the scored triplets for the "second training"). So that one can train ColBERTv2 and get the same results as you did (which would make it easier to try some improvements on the model).
Great work! We would also love to have access to the w-way tuples you used for training, that would be easier for us to replicate your work. Also, could you elaborate a little more on how the "higher ranked passage" and "lower ranked passages" are selected? For example, what are the specific range of positions you consider when selecting the high and low ranked passages?
you can create example train.sh, i don't know how to pass params and how to run
Hi, is there any update on this?
First of all, great work !
Any plans to add scripts for colbert v2 training ? If not, some pointers what parts of the code needs to be changed. For example, if I put distillation scores in the triplet json file, do I need to change or add something around L130-L140 of training.py ? Thanks!
May I ask where to obtain JSON format data
is their any update on the training code of ColBERT v2
I’m pretty sure all code you need has been in the repo for at least 15-16 months. We don’t have detailed instructions yet though.. I can paste some examples below in a day or two
Thank you sir, that would be wonderful
Hey @okhat compounding on what I told you today during the workshop, it would be really appreciated to have some instructions and maybe the data file (if it still exists) you used to train your version of ColBERTv2. I understand that the code is available, but unfortunately it is not so easy (for me and a few more researchers) to make it work.
The code for training is pretty simple, and the args used for training the model are stored in the checkpoint too. I believe many people have used it; here's a snippet:
from colbert.infra.run import Run
from colbert.infra.config import ColBERTConfig, RunConfig
from colbert import Trainer
def train():
with Run().context(RunConfig(nranks=4)):
triples = 'triples.json'
queries = 'queries.train.tsv'
collection = 'collection.tsv'
config = ColBERTConfig(bsize=32, lr=1e-05, warmup=20_000, doc_maxlen=180, dim=128, nway=64, accumsteps=1, use_ib_negatives=True)
trainer = Trainer(triples=triples, queries=queries, collection=collection, config=config)
trainer.train(checkpoint='initial_ckpt_path')
if __name__ == '__main__':
train()
I'm happy to share the tuples we used for training, but the file is gigantic (~26GB). I also have the initial checkpoint and can share it with you, though as hypothesized in the full paper it's almost certainly not essential (raw bert-base-uncased should work just fine).
Each line has the following structure:
[646158,[835711,9.7109375],[1420483,-3.34765625],[3281369,-4.328125],[4669769,-1.17578125],[7583115,1.9189453125],[2115204,-3.47265625],[455241,-2.572265625],[4581924,-0.4794921875],[1037129,-4.453125],[1078885,-2.837890625],[307673,-2.185546875],[5560785,-2.25],[7926175,4.4765625],[1316628,-3.650390625],[413536,-0.223876953125],[7120194,5.42578125],[4669773,-0.6689453125],[6074871,-8.1875],[48633,1.2470703125],[6129077,-2.609375],[7926181,6.01953125],[608834,-3.498046875],[1190049,-4.21484375],[3706610,-0.6728515625],[8129483,2.3984375],[4296354,0.162353515625],[5493715,2.279296875],[698231,-2.12890625],[1936349,-7.65234375],[4885140,2.23046875],[455239,-1.6484375],[7455097,-0.2093505859375],[455238,1.8603515625],[6742992,-0.475341796875],[3830145,-4.828125],[4564710,-2.365234375],[7120201,-1.7333984375],[1055876,-0.27490234375],[6023773,-5.23828125],[3609867,-4.28515625],[8493097,0.3388671875],[7139673,-1.1416015625],[2630250,-1.87890625],[8086203,-6.27734375],[6918809,-2.533203125],[4221375,-1.3837890625],[48636,2.20703125],[6863652,-4.3828125],[604975,-2.0],[3982496,-3.294921875],[220159,1.4990234375],[2140129,-4.93359375],[7877610,0.96142578125],[4023657,-2.541015625],[5862923,-4.53125],[8371031,-1.18359375],[1156095,0.1455078125],[7899193,-6.25390625],[2957376,3.09375],[4591118,-3.09765625],[5560789,1.98828125],[8712513,-5.953125],[5416214,1.4638671875],[1954683,0.78759765625]]
That is, "qid", followed by a "positive passage" (labeled in qrels or just the top-1 passage even if not in qrels) and its cross-encoder score, then "w-1" (in this case, 64-1 as in the paper) negatives from another retrieval round with their cross-encoder scores.
As far as I can tell, the code for this has always been there. Hopefully having a snippet helps!
Feel free to post followups if you face issues.
can you tell how to generate the dataset by triples.train.small.tar.gz
You start with an existing retriever (either BM25 or ColBERTv1 for example), index the collection, search with the training questions, find the top-K. Score <q, d> pairs with the cross-encoder teacher model. Sample them into the format above (you don't need 63 negatives btw, you can get excellent performance with just 3 or 7). Then you train!
Thank you sir
Btw there may be code for cross-encoding in the library too... like launching this in parallel on multi-GPU. But anyway this part is not specific to colbert.
Could you provide example arguments when using the utility/ directory? I can see arguments and it would be nice to understand if you provide any example for each arguments.utility/triples.py--ranking, --output, --positives, --depth, --permissive, --biased, --seed
I'm excited to finally close this issue. Data below.
https://twitter.com/lateinteraction/status/1753428544259346935
Basically, I uploaded the examples file (64-way) and initial checkpoint (colbert v1.9) to HF hub.
The instructions are now in the README.
Instructions:
https://github.com/stanford-futuredata/ColBERT?tab=readme-ov-file#advanced-training-colbertv2-style
Examples file:
https://huggingface.co/colbert-ir/colbertv2.0_msmarco_64way/blob/main/examples.json
from colbert.infra.run import Run
from colbert.infra.config import ColBERTConfig, RunConfig
from colbert import Trainer
def train():
# use 4 gpus (e.g. four A100s, but you can use fewer by changing nway,accumsteps,bsize).
with Run().context(RunConfig(nranks=4)):
triples = '/path/to/examples.64.json' # `wget https://huggingface.co/colbert-ir/colbertv2.0_msmarco_64way/resolve/main/examples.json?download=true` (26GB)
queries = '/path/to/MSMARCO/queries.train.tsv'
collection = '/path/to/MSMARCO/collection.tsv'
config = ColBERTConfig(bsize=32, lr=1e-05, warmup=20_000, doc_maxlen=180, dim=128, attend_to_mask_tokens=False, nway=64, accumsteps=1, similarity='cosine', use_ib_negatives=True)
trainer = Trainer(triples=triples, queries=queries, collection=collection, config=config)
trainer.train(checkpoint='colbert-ir/colbertv1.9') # or start from scratch, like `bert-base-uncased`
if __name__ == '__main__':
train()