ColBERT icon indicating copy to clipboard operation
ColBERT copied to clipboard

Sample code training and dataset of ColBert V2

Open bino282 opened this issue 2 years ago • 7 comments

hope you release the training code of colbert v2

bino282 avatar Feb 21 '22 02:02 bino282

Thanks for your interest! Yes, we'll release instructions for ColBERTv2 soon. (FWIW, most of the code is already there: you just need to provide distillation scores in the triples jsonl files.)

okhat avatar Feb 21 '22 02:02 okhat

I'll update here once that's done.

okhat avatar Feb 21 '22 02:02 okhat

By the way, what did you mean by "dataset" here? All raw data we've used is public (e.g., LoTTE, MS MARCO, BEIR)

okhat avatar Mar 04 '22 17:03 okhat

At least on our side (we are also interested on this :) would be the exact files or code to download and generate them (so in the case of ColBERTv2 the MSMARCO triplets for the "first training" and the scored triplets for the "second training"). So that one can train ColBERTv2 and get the same results as you did (which would make it easier to try some improvements on the model).

cadurosar avatar Mar 09 '22 15:03 cadurosar

Great work! We would also love to have access to the w-way tuples you used for training, that would be easier for us to replicate your work. Also, could you elaborate a little more on how the "higher ranked passage" and "lower ranked passages" are selected? For example, what are the specific range of positions you consider when selecting the high and low ranked passages?

yingrui-yang avatar Mar 30 '22 03:03 yingrui-yang

you can create example train.sh, i don't know how to pass params and how to run

bino282 avatar Apr 05 '22 03:04 bino282

Hi, is there any update on this?

kwojtasi avatar Sep 20 '22 11:09 kwojtasi

First of all, great work !

Any plans to add scripts for colbert v2 training ? If not, some pointers what parts of the code needs to be changed. For example, if I put distillation scores in the triplet json file, do I need to change or add something around L130-L140 of training.py ? Thanks!

jamesoneill12 avatar Nov 09 '22 16:11 jamesoneill12

May I ask where to obtain JSON format data

liuhuaizheng avatar Jun 14 '23 09:06 liuhuaizheng

is their any update on the training code of ColBERT v2

huyhuyvu01 avatar Jun 30 '23 03:06 huyhuyvu01

I’m pretty sure all code you need has been in the repo for at least 15-16 months. We don’t have detailed instructions yet though.. I can paste some examples below in a day or two

okhat avatar Jun 30 '23 03:06 okhat

Thank you sir, that would be wonderful

huyhuyvu01 avatar Jun 30 '23 03:06 huyhuyvu01

Hey @okhat compounding on what I told you today during the workshop, it would be really appreciated to have some instructions and maybe the data file (if it still exists) you used to train your version of ColBERTv2. I understand that the code is available, but unfortunately it is not so easy (for me and a few more researchers) to make it work.

cadurosar avatar Jul 27 '23 06:07 cadurosar

The code for training is pretty simple, and the args used for training the model are stored in the checkpoint too. I believe many people have used it; here's a snippet:

from colbert.infra.run import Run
from colbert.infra.config import ColBERTConfig, RunConfig
from colbert import Trainer


def train():
    with Run().context(RunConfig(nranks=4)):
        triples = 'triples.json'
        queries = 'queries.train.tsv'
        collection = 'collection.tsv'

        config = ColBERTConfig(bsize=32, lr=1e-05, warmup=20_000, doc_maxlen=180, dim=128, nway=64, accumsteps=1, use_ib_negatives=True)
        trainer = Trainer(triples=triples, queries=queries, collection=collection, config=config)

        trainer.train(checkpoint='initial_ckpt_path')


if __name__ == '__main__':
    train()

I'm happy to share the tuples we used for training, but the file is gigantic (~26GB). I also have the initial checkpoint and can share it with you, though as hypothesized in the full paper it's almost certainly not essential (raw bert-base-uncased should work just fine).

Each line has the following structure:

[646158,[835711,9.7109375],[1420483,-3.34765625],[3281369,-4.328125],[4669769,-1.17578125],[7583115,1.9189453125],[2115204,-3.47265625],[455241,-2.572265625],[4581924,-0.4794921875],[1037129,-4.453125],[1078885,-2.837890625],[307673,-2.185546875],[5560785,-2.25],[7926175,4.4765625],[1316628,-3.650390625],[413536,-0.223876953125],[7120194,5.42578125],[4669773,-0.6689453125],[6074871,-8.1875],[48633,1.2470703125],[6129077,-2.609375],[7926181,6.01953125],[608834,-3.498046875],[1190049,-4.21484375],[3706610,-0.6728515625],[8129483,2.3984375],[4296354,0.162353515625],[5493715,2.279296875],[698231,-2.12890625],[1936349,-7.65234375],[4885140,2.23046875],[455239,-1.6484375],[7455097,-0.2093505859375],[455238,1.8603515625],[6742992,-0.475341796875],[3830145,-4.828125],[4564710,-2.365234375],[7120201,-1.7333984375],[1055876,-0.27490234375],[6023773,-5.23828125],[3609867,-4.28515625],[8493097,0.3388671875],[7139673,-1.1416015625],[2630250,-1.87890625],[8086203,-6.27734375],[6918809,-2.533203125],[4221375,-1.3837890625],[48636,2.20703125],[6863652,-4.3828125],[604975,-2.0],[3982496,-3.294921875],[220159,1.4990234375],[2140129,-4.93359375],[7877610,0.96142578125],[4023657,-2.541015625],[5862923,-4.53125],[8371031,-1.18359375],[1156095,0.1455078125],[7899193,-6.25390625],[2957376,3.09375],[4591118,-3.09765625],[5560789,1.98828125],[8712513,-5.953125],[5416214,1.4638671875],[1954683,0.78759765625]]

That is, "qid", followed by a "positive passage" (labeled in qrels or just the top-1 passage even if not in qrels) and its cross-encoder score, then "w-1" (in this case, 64-1 as in the paper) negatives from another retrieval round with their cross-encoder scores.

okhat avatar Jul 27 '23 08:07 okhat

As far as I can tell, the code for this has always been there. Hopefully having a snippet helps!

Feel free to post followups if you face issues.

okhat avatar Jul 27 '23 08:07 okhat

can you tell how to generate the dataset by triples.train.small.tar.gz

liuhuaizheng avatar Jul 27 '23 08:07 liuhuaizheng

You start with an existing retriever (either BM25 or ColBERTv1 for example), index the collection, search with the training questions, find the top-K. Score <q, d> pairs with the cross-encoder teacher model. Sample them into the format above (you don't need 63 negatives btw, you can get excellent performance with just 3 or 7). Then you train!

okhat avatar Jul 27 '23 08:07 okhat

Thank you sir

liuhuaizheng avatar Jul 27 '23 08:07 liuhuaizheng

Btw there may be code for cross-encoding in the library too... like launching this in parallel on multi-GPU. But anyway this part is not specific to colbert.

okhat avatar Jul 27 '23 08:07 okhat

Could you provide example arguments when using the utility/ directory? I can see arguments and it would be nice to understand if you provide any example for each arguments.utility/triples.py--ranking, --output, --positives, --depth, --permissive, --biased, --seed

liuhuaizheng avatar Jul 27 '23 09:07 liuhuaizheng

I'm excited to finally close this issue. Data below.

https://twitter.com/lateinteraction/status/1753428544259346935

okhat avatar Feb 02 '24 14:02 okhat

Basically, I uploaded the examples file (64-way) and initial checkpoint (colbert v1.9) to HF hub.

The instructions are now in the README.

Instructions:

https://github.com/stanford-futuredata/ColBERT?tab=readme-ov-file#advanced-training-colbertv2-style

Examples file:

https://huggingface.co/colbert-ir/colbertv2.0_msmarco_64way/blob/main/examples.json

okhat avatar Feb 02 '24 14:02 okhat


from colbert.infra.run import Run
from colbert.infra.config import ColBERTConfig, RunConfig
from colbert import Trainer


def train():
    # use 4 gpus (e.g. four A100s, but you can use fewer by changing nway,accumsteps,bsize).
    with Run().context(RunConfig(nranks=4)):
        triples = '/path/to/examples.64.json'  # `wget https://huggingface.co/colbert-ir/colbertv2.0_msmarco_64way/resolve/main/examples.json?download=true` (26GB)
        queries = '/path/to/MSMARCO/queries.train.tsv'
        collection = '/path/to/MSMARCO/collection.tsv'

        config = ColBERTConfig(bsize=32, lr=1e-05, warmup=20_000, doc_maxlen=180, dim=128, attend_to_mask_tokens=False, nway=64, accumsteps=1, similarity='cosine', use_ib_negatives=True)
        trainer = Trainer(triples=triples, queries=queries, collection=collection, config=config)

        trainer.train(checkpoint='colbert-ir/colbertv1.9')  # or start from scratch, like `bert-base-uncased`


if __name__ == '__main__':
    train()

okhat avatar Feb 02 '24 14:02 okhat