ColBERT troubleshooting encoding performance

troubleshooting encoding performance

Open jbellis opened this issue 1 year ago • 1 comments

I'm trying to do low level encoding so I can add the vectors to my own index:

        cf = ColBERTConfig(checkpoint='checkpoints/colbertv2.0')
        cp = Checkpoint(cf.checkpoint, colbert_config=cf)
        encoder = CollectionEncoder(cf, cp)
        passages = ...
        encoder.encode_passages(passages)

this works, but it is slow and nvidia-smi says the gpu is almost entirely idle (1%-5% util), even if I spin up multiple threads (each with their own encoder of course). Is this expected?

I do see

>>> torch.cuda.is_available()
True

but that's about the extent of my troubleshooting knowledge.

Feb 03 '24 05:02 jbellis

Few questions:

Have you tried using the PyTorch data profiler? https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html I'd probably start there.
How are you loading the data? It looks like your dataset is loaded from memory, but I want to confirm there's not an issue with the loading step. PyTorch has specific classes:

Creating a custom dataset: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files
Creating a data loader for that dataset: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#preparing-your-data-for-training-with-dataloaders

What value are you setting for index_bsize ? You probably want to increase this value until it breaks and then bring it back down. If data transfers are frequently going back and forth between the CPU and GPU, that will bottleneck a lot of GPU processing.

Feb 06 '24 05:02 devinbost

ColBERT ColBERT copied to clipboard

troubleshooting encoding performance

ColBERT
ColBERT copied to clipboard