ColBERT
ColBERT copied to clipboard
troubleshooting encoding performance
I'm trying to do low level encoding so I can add the vectors to my own index:
cf = ColBERTConfig(checkpoint='checkpoints/colbertv2.0')
cp = Checkpoint(cf.checkpoint, colbert_config=cf)
encoder = CollectionEncoder(cf, cp)
passages = ...
encoder.encode_passages(passages)
this works, but it is slow and nvidia-smi says the gpu is almost entirely idle (1%-5% util), even if I spin up multiple threads (each with their own encoder of course). Is this expected?
I do see
>>> torch.cuda.is_available()
True
but that's about the extent of my troubleshooting knowledge.
Few questions:
- Have you tried using the PyTorch data profiler? https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html I'd probably start there.
- How are you loading the data? It looks like your dataset is loaded from memory, but I want to confirm there's not an issue with the loading step. PyTorch has specific classes:
- Creating a custom dataset: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files
- Creating a data loader for that dataset: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#preparing-your-data-for-training-with-dataloaders
- What value are you setting for
index_bsize
? You probably want to increase this value until it breaks and then bring it back down. If data transfers are frequently going back and forth between the CPU and GPU, that will bottleneck a lot of GPU processing.