Propose chunked computation for the `RerankingEvaluator`
The MindSmallReranking dataset contains 2,362,514 queries, 107,968 positive docs, 2,550,123 negative docs.
Currently, RerankingEvaluator.compute_metrics_batched() just gather all texts together and encode them, which would require a lot of memory / GPU memory. (I got CUDA OOM on 32GB V100.)
I made minor modifications to the code to implement chunked computation, reducing memory usage.
If this change is acceptable, I would be glad to make a PR. Thanks.
Hmm usually the batch_size kwarg is supposed to solve that issue, i.e. in the model's encode function the batch_size kwarg is used to make sure it fits into GPU memory - Is it not working for you?
Hmm, yeah, it maybe about the inference batch_size.
I find that i did the wrong math, sorry.
In RerankingEvaluator.compute_metrics_batched() (line 84 & 94), SentenceTransformer.encode is called with convert_to_tensor=True.
My embedding dim is 1024, this dataset has 5,020,605 texts, 5020605*1024*4/1024/1024/1024 ~= 19.15 GB GPU MEM is needed.
It's possible that using the large batch size caused PyTorch to reserve too much GPU memory, resulting in the OOM.
I will check it. Thank you.
Hi, I think the chunking is still needed.
Recall that all docs are gatherd and encoded with SentenceTransformer.encode at once, where torch.stack(all_embeddings) is called to merge batched embeddings.
Since stack is not a in_place operation, PyTorch will copy all tensors.
Then boom.
Batches: 100%|█████████▉| 20764/20767 [53:55<00:00, 22.66it/s]
Batches: 100%|██████████| 20767/20767 [53:55<00:00, 21.83it/s]
Batches: 100%|██████████| 20767/20767 [53:55<00:00, 6.42it/s]
2023-04-24 12:38:30,170 - ERROR - mteb.evaluation.MTEB : Error while evaluating MindSmallReranking: CUDA out of memory. Tried to allocate 10.14 GiB (GPU 0; 31.75 GiB total capacity; 21.24 GiB already allocated; 9.38 GiB free; 21.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-04-24 12:38:30,170 - ERROR - mteb.evaluation.MTEB : Please check all the error logs at: error_logs.txt
2023-04-24 12:38:30.190400 >>> MindSmallReranking
Traceback (most recent call last):
File "/nas/emb/mteb/mteb/evaluation/MTEB.py", line 244, in run
results = task.evaluate(model, split, **kwargs)
File "/nas/emb/mteb/mteb/abstasks/AbsTaskReranking.py", line 24, in evaluate
scores = evaluator(model)
File "/nas/emb/mteb/mteb/evaluation/evaluators/RerankingEvaluator.py", line 56, in __call__
scores = self.compute_metrics(model)
File "/nas/emb/mteb/mteb/evaluation/evaluators/RerankingEvaluator.py", line 61, in compute_metrics
self.compute_metrics_batched(model)
File "/nas/emb/mteb/mteb/evaluation/evaluators/RerankingEvaluator.py", line 94, in compute_metrics_batched
all_docs_embs = model.encode(all_docs, convert_to_tensor=True, batch_size=self.batch_size)
File "run_mteb.py", line 79, in encode
return self.model.encode(
File "/mnt/nas/emb/src/models.py", line 152, in encode
return super().encode(sentences, **kwargs)
File "/nas/envs/mteb/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 195, in encode
all_embeddings = torch.stack(all_embeddings)
RuntimeError: CUDA out of memory. Tried to allocate 10.14 GiB (GPU 0; 31.75 GiB total capacity; 21.24 GiB already allocated; 9.38 GiB free; 21.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
And I write a simulation script that can quickly reproduct the case.
import subprocess
import tqdm
import torch
command = 'nvidia-smi'
embedding_dim = 1024
batch_size = 128
print('model and queries')
fake_model_param = torch.zeros(560, 1000000, requires_grad=False).cuda()
queries = torch.zeros(2362514, embedding_dim, requires_grad=False).cuda()
subprocess.call(command) # 12466MiB / 32510MiB
print('docs')
tensors = list()
num_docs = 2550123 + 107968
for i in tqdm.tqdm(range(0, num_docs, batch_size)):
tensors.append(torch.zeros(batch_size, embedding_dim, requires_grad=False).cuda())
subprocess.call(command) # 22850MiB / 32510MiB
# 22850 - 12466 = 10384 these are docs
print('stacking docs')
embedding = torch.stack(tensors)
subprocess.call(command) # need 22850 + 10384 = 33234 > 32510
Good point - I think that's a problem with SentenceTransformers though - They should cast the tensors to CPU before doing the stacking operation.
Since we can't modify their source code, we can remove the convert_to_tensor (i.e. let it be converted to numpy in the encode function) and then just convert the numpy array back to a torch tensor. We need the torch tensor format for the scoring. E.g. we do torch.tensor(model.encode(all_docs, convert_to_tensor=False, batch_size=self.batch_size))
What do you think about this solution?
This is indeed a problem with sentence-transformers, as they did not consider calls at the million level. I agree that your method is a reasonable solution.
I think the large memory or GPU memory requirement for storing millions of embedding could be avoided. Such as the chunking in BEIR exact_search.
emmm, Could I ask for the reason why you disagree to use chunking? After all, it's almost free and doesn't impact anything. It just split self.samples into multiple chunked_samples.
This is indeed a problem with sentence-transformers, as they did not consider calls at the million level. I agree that your method is a reasonable solution.
I think the large memory or GPU memory requirement for storing millions of embedding could be avoided. Such as the chunking in BEIR exact_search.
emmm, Could I ask for the reason why you disagree to use chunking? After all, it's almost free and doesn't impact anything. It just split
self.samplesinto multiplechunked_samples.
I think that solution is fine, too - but I'm not sure how you would modify the chunk_size kwarg? It's not used anywhere in MTEB thus far afaict.
The nice thing about removing the convert_to_tensor stuff is that models which are not SentenceTransformers do not have this kwarg anyways and probably also not that stacking problem, so it would make it more compatible with those.
I believe this issue is stale. Will close it