FlagEmbedding icon indicating copy to clipboard operation
FlagEmbedding copied to clipboard

BGEM3FlagModel显卡调用问题

Open Anthony-Sun-S opened this issue 1 year ago • 1 comments

import os os.environ["CUDA_VISIBLE_DEVICES"]="1,5" from FlagEmbedding import BGEM3FlagModel model = BGEM3FlagModel('BAAI_bge-m3', use_fp16=True) 我再使用上述代码调用显卡时发现如果我调用两张显卡,那么程序能正常运行,但是显存占用差距特别大,第一张卡显存占用可以超过40G,第二张卡占用只有4G左右;一旦我设置的显卡数量超过2张卡,就会报错,报错信息是:

Traceback (most recent call last):
  File "inference_m3.py", line 44, in <module>
    score = model.compute_score(batch, max_passage_length=1024, weights_for_different_modes=[0.4, 0.2, 0.4])
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/FlagEmbedding/bge_m3.py", line 235, in compute_score
    queries_output = self.model(queries_inputs, return_dense=True, return_sparse=True, return_colbert=True,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/FlagEmbedding/BGE_M3/modeling.py", line 350, in forward
    last_hidden_state = self.model(**text_input, return_dict=True).last_hidden_state
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 184, in forward
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/data_parallel.py", line 189, in replicate
    return replicate(module, device_ids, not torch.is_grad_enabled())
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/replicate.py", line 110, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/replicate.py", line 79, in _broadcast_coalesced_reshape
    return comm.broadcast_coalesced(tensors, devices)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/comm.py", line 57, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)

请问这该怎么解决呢?FlagEmbedding调用多显卡有什么正确的调用方法吗?

Anthony-Sun-S avatar Feb 05 '24 07:02 Anthony-Sun-S

您好,compute_score时用的dataparall,所有显卡的结果都会返回到第一张卡上,由于colbert和sparse vector空间占用比较大,会导致第一张卡显存占用很大,尤其是在数据比较长的时候。 目前实现的compute_score只是个样咧,实际使用还需优化。如果需要compute_score来进行重排,可以尝试使用bge-reranker

staoxiao avatar Feb 05 '24 14:02 staoxiao

我是在测试检索阶段还没有进到重排,发现使用sentence-transformer的情况下m3比v1.5large表现好很多,换成flagembedding的compute_score时指标还能再提升;之前也有试过直接使用rerank,但是还是有一定差距的

Anthony-Sun-S avatar Feb 06 '24 02:02 Anthony-Sun-S

我是在测试检索阶段还没有进到重排,发现使用sentence-transformer的情况下m3比v1.5large表现好很多,换成flagembedding的compute_score时指标还能再提升;之前也有试过直接使用rerank,但是还是有一定差距的

感谢反馈!compute_score主要是使用了colbert进行计算,我们后面会考虑对其进行压缩,减小显存使用。

staoxiao avatar Feb 06 '24 07:02 staoxiao

好的,期待您的更新,辛苦了!

Anthony-Sun-S avatar Feb 06 '24 09:02 Anthony-Sun-S