FlagEmbedding
FlagEmbedding copied to clipboard
BGE-M3 Sparse
Currently i cannot really use the "Sparse mode" of BGE-M3. Even with 8 GB VRAM and small batch sizes I get CUDA out of mem. Why does this mode need so much VRAM? Is this to be expected? The other modes (Dense/ColBERT) don't run into this, even with large batches. Can this somehow be mitigated? Split over GPUs?
File "/usr/lib/python3/dist-packages/FlagEmbedding/BGE_M3/modeling.py", line 357, in forward sparse_vecs = self.sparse_embedding(last_hidden_state, text_input['input_ids'], File "/usr/lib/python3/dist-packages/FlagEmbedding/BGE_M3/modeling.py", line 106, in sparse_embedding sparse_embedding = torch.zeros(input_ids.size(0), input_ids.size(1), self.vocab_size, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.66 GiB.
Hello! Could you paste your code here? I will check it.
Hi, I just call your methods without much fluff:
model = BGEM3FlagModel(
"BAAI/bge-m3", use_fp16=True
)
passages_outputs = model.model(
passages_inputs,
return_dense=False,
return_sparse=True,
return_colbert=False,
return_sparse_embedding=True
)
I just follow compute_score here https://github.com/FlagOpen/FlagEmbedding/blob/11dc092e39ed0ff6e715866b2bdaca0cc775a296/FlagEmbedding/bge_m3.py#L188 which also uses sparse_vecs and not lexical_weights.
Nothing special. But I think I understand the problem. See method sparse_embedding here: https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/BGE_M3/modeling.py
sparse_embedding = torch.zeros(input_ids.size(0), input_ids.size(1), self.vocab_size,
dtype=token_weights.dtype,
device=token_weights.device)
Because the vocab size is quite large with 250.000, this method tries to get 250.000 * 4 byte * 512 Tokens = 0.5 GB per sequence (assuming sequence is just 512 tokens short, what I have). So a batch with 10 entries of short 512-token sequences already needs 5 GB ! That doesn't scale well...and I will not talk about 8k sequences here.
Scattering this single token weights over such a huge sparse tensor just for some max-pooling operation doesn't sound efficient to me - the memory I/O preasure on VRAM is intense, even if it would be available - but I don't have the time to deep dive in EmbeddingBag etc.
We recommend to use encode
function, which will return a dict instead of a sparse embedding. You can refer to https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/BGE_M3#generate-embedding-for-text
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
sentences_1 = ["What is BGE M3?", "Defination of BM25"]
sentences_2 = ["BGE M3 is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)
# you can see the weight for each token:
print(model.convert_id_to_token(output_1['lexical_weights']))
# [{'What': 0.08356, 'is': 0.0814, 'B': 0.1296, 'GE': 0.252, 'M': 0.1702, '3': 0.2695, '?': 0.04092},
# {'De': 0.05005, 'fin': 0.1368, 'ation': 0.04498, 'of': 0.0633, 'BM': 0.2515, '25': 0.3335}]
# compute the scores via lexical mathcing
lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
print(lexical_scores)
# 0.19554901123046875
print(model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_1['lexical_weights'][1]))
# 0.0
Thank you all, I will try.
In this case you should adapt your example https://huggingface.co/BAAI/bge-m3 "Compute score for text pairs" which uses method model.compute_score() which in turn uses sparse embeddings?
Is it really the same quality at the end? I don't fully understand the difference right now, when to use token-weights and when sparse-vecs. sparse-vecs for training and token-dict for inference?
There is no difference between the results of sparse-vecs and token-weights. sparse-vecs is suitable for training in GPUs, because it can be used as tensor.