mteb Propose chunked computation for the `RerankingEvaluator`

The MindSmallReranking dataset contains 2,362,514 queries, 107,968 positive docs, 2,550,123 negative docs.

Currently, RerankingEvaluator.compute_metrics_batched() just gather all texts together and encode them, which would require a lot of memory / GPU memory. (I got CUDA OOM on 32GB V100.)

I made minor modifications to the code to implement chunked computation, reducing memory usage.

If this change is acceptable, I would be glad to make a PR. Thanks.

Apr 23 '23 11:04 izhx

Hmm usually the batch_size kwarg is supposed to solve that issue, i.e. in the model's encode function the batch_size kwarg is used to make sure it fits into GPU memory - Is it not working for you?

Apr 23 '23 15:04 Muennighoff

Hmm, yeah, it maybe about the inference batch_size. I find that i did the wrong math, sorry. In RerankingEvaluator.compute_metrics_batched() (line 84 & 94), SentenceTransformer.encode is called with convert_to_tensor=True. My embedding dim is 1024, this dataset has 5,020,605 texts, 5020605*1024*4/1024/1024/1024 ~= 19.15 GB GPU MEM is needed. It's possible that using the large batch size caused PyTorch to reserve too much GPU memory, resulting in the OOM.

I will check it. Thank you.

Apr 23 '23 15:04 izhx

Hi, I think the chunking is still needed. Recall that all docs are gatherd and encoded with SentenceTransformer.encode at once, where torch.stack(all_embeddings) is called to merge batched embeddings. Since stack is not a in_place operation, PyTorch will copy all tensors. Then boom.

Batches: 100%|█████████▉| 20764/20767 [53:55<00:00, 22.66it/s]
Batches: 100%|██████████| 20767/20767 [53:55<00:00, 21.83it/s]
Batches: 100%|██████████| 20767/20767 [53:55<00:00,  6.42it/s]
2023-04-24 12:38:30,170 - ERROR - mteb.evaluation.MTEB : Error while evaluating MindSmallReranking: CUDA out of memory. Tried to allocate 10.14 GiB (GPU 0; 31.75 GiB total capacity; 21.24 GiB already allocated; 9.38 GiB free; 21.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2023-04-24 12:38:30,170 - ERROR - mteb.evaluation.MTEB : Please check all the error logs at: error_logs.txt

2023-04-24 12:38:30.190400 >>> MindSmallReranking
Traceback (most recent call last):
  File "/nas/emb/mteb/mteb/evaluation/MTEB.py", line 244, in run
    results = task.evaluate(model, split, **kwargs)
  File "/nas/emb/mteb/mteb/abstasks/AbsTaskReranking.py", line 24, in evaluate
    scores = evaluator(model)
  File "/nas/emb/mteb/mteb/evaluation/evaluators/RerankingEvaluator.py", line 56, in __call__
    scores = self.compute_metrics(model)
  File "/nas/emb/mteb/mteb/evaluation/evaluators/RerankingEvaluator.py", line 61, in compute_metrics
    self.compute_metrics_batched(model)
  File "/nas/emb/mteb/mteb/evaluation/evaluators/RerankingEvaluator.py", line 94, in compute_metrics_batched
    all_docs_embs = model.encode(all_docs, convert_to_tensor=True, batch_size=self.batch_size)
  File "run_mteb.py", line 79, in encode
    return self.model.encode(
  File "/mnt/nas/emb/src/models.py", line 152, in encode
    return super().encode(sentences, **kwargs)
  File "/nas/envs/mteb/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py", line 195, in encode
    all_embeddings = torch.stack(all_embeddings)
RuntimeError: CUDA out of memory. Tried to allocate 10.14 GiB (GPU 0; 31.75 GiB total capacity; 21.24 GiB already allocated; 9.38 GiB free; 21.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Apr 24 '23 07:04 izhx

And I write a simulation script that can quickly reproduct the case.

import subprocess

import tqdm
import torch

command = 'nvidia-smi'
embedding_dim = 1024
batch_size = 128

print('model and queries')
fake_model_param = torch.zeros(560, 1000000, requires_grad=False).cuda()
queries = torch.zeros(2362514, embedding_dim, requires_grad=False).cuda()

subprocess.call(command)  # 12466MiB / 32510MiB

print('docs')
tensors = list()
num_docs = 2550123 + 107968
for i in tqdm.tqdm(range(0, num_docs, batch_size)):
    tensors.append(torch.zeros(batch_size, embedding_dim, requires_grad=False).cuda())

subprocess.call(command)  # 22850MiB / 32510MiB
# 22850 - 12466 = 10384  these are docs

print('stacking docs')
embedding = torch.stack(tensors)

subprocess.call(command)  # need 22850 + 10384 = 33234 > 32510

Apr 24 '23 07:04 izhx

Good point - I think that's a problem with SentenceTransformers though - They should cast the tensors to CPU before doing the stacking operation.

Since we can't modify their source code, we can remove the convert_to_tensor (i.e. let it be converted to numpy in the encode function) and then just convert the numpy array back to a torch tensor. We need the torch tensor format for the scoring. E.g. we do torch.tensor(model.encode(all_docs, convert_to_tensor=False, batch_size=self.batch_size))

What do you think about this solution?

Apr 24 '23 19:04 Muennighoff

This is indeed a problem with sentence-transformers, as they did not consider calls at the million level. I agree that your method is a reasonable solution.

I think the large memory or GPU memory requirement for storing millions of embedding could be avoided. Such as the chunking in BEIR exact_search.

emmm, Could I ask for the reason why you disagree to use chunking? After all, it's almost free and doesn't impact anything. It just split self.samples into multiple chunked_samples.

Apr 25 '23 02:04 izhx

This is indeed a problem with sentence-transformers, as they did not consider calls at the million level. I agree that your method is a reasonable solution.

I think the large memory or GPU memory requirement for storing millions of embedding could be avoided. Such as the chunking in BEIR exact_search.

emmm, Could I ask for the reason why you disagree to use chunking? After all, it's almost free and doesn't impact anything. It just split self.samples into multiple chunked_samples.

I think that solution is fine, too - but I'm not sure how you would modify the chunk_size kwarg? It's not used anywhere in MTEB thus far afaict.

The nice thing about removing the convert_to_tensor stuff is that models which are not SentenceTransformers do not have this kwarg anyways and probably also not that stacking problem, so it would make it more compatible with those.

Apr 25 '23 05:04 Muennighoff

I believe this issue is stale. Will close it

Jun 05 '24 18:06 KennethEnevoldsen