FlagEmbedding icon indicating copy to clipboard operation
FlagEmbedding copied to clipboard

直接部署embedding,在调用API时,如果请求数稍微大一点,model.encode(query_list)报NCCL Error 3错误

Open zouchl opened this issue 2 years ago • 1 comments

cuda11.8 12.2都存在这个问题, embeddings = model.encode(queries) File "/usr/local/python3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/python3/lib/python3.10/site-packages/FlagEmbedding/flag_models.py", line 90, in encode last_hidden_state = self.model(**inputs, return_dict=True).last_hidden_state File "/usr/local/python3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/python3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/python3/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 184, in forward replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) File "/usr/local/python3/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 189, in replicate return replicate(module, device_ids, not torch.is_grad_enabled()) File "/usr/local/python3/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 110, in replicate param_copies = _broadcast_coalesced_reshape(params, devices, detach) File "/usr/local/python3/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 79, in _broadcast_coalesced_reshape return comm.broadcast_coalesced(tensors, devices) File "/usr/local/python3/lib/python3.10/site-packages/torch/nn/parallel/comm.py", line 57, in broadcast_coalesced return torch._C._broadcast_coalesced(tensors, devices, buffer_size) RuntimeError: NCCL Error 3: internal error - please report this issue to the NCCL developers 请教大佬 这一般是什么情况。

zouchl avatar Dec 01 '23 09:12 zouchl

您好,不太了解gunicorn,可以换成使用sentence transformers工具试试。

staoxiao avatar Dec 04 '23 02:12 staoxiao