[Bug]: PDF parsing failed with NCCL error using gpus
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (Language Policy).
- [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- [x] Please do not modify this template :) and fill in all the required fields.
RAGFlow workspace code commit ID
d447392
RAGFlow image version
d447392(v0.17.1)
Other environment information
Actual behavior
PDF parsing failed with NCCL error using gpus, the error tracebacks are shown below:
Traceback (most recent call last):
File "/ragflow/rag/svr/task_executor.py", line 584, in handle_task
await do_handle_task(task)
File "/ragflow/rag/svr/task_executor.py", line 530, in do_handle_task
token_count, vector_size = await embedding(chunks, embedding_model, task_parser_config, progress_callback)
File "/ragflow/rag/svr/task_executor.py", line 392, in embedding
vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size]))
File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 447, in to_thread_run_sync
return msg_from_thread.unwrap()
File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap
raise captured_error
File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 373, in do_release_then_return_result
return result.unwrap()
File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap
raise captured_error
File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 392, in worker_fn
ret = context.run(sync_fn, *args)
File "/ragflow/rag/svr/task_executor.py", line 392, in
Expected behavior
No response
Steps to reproduce
Deploy docker with command `RAGFLOW_IMAGE=infiniflow/ragflow:v0.17.1 docker compose -f docker-compose-gpu.yml up -d` and then upload pdf file on the website client and parse.
Additional information
No response
FYI.
FYI.
It seems like this isn't working on our machine. Maybe it's because the gpus are outdated? I'm not sure.
What type of GPU are you using.
I also encountered this problem,But I added environment variables during the build process
I also added it to the host computer
Then I parsed the file and found a new bug
The log reminds me of Chunking done,and This could be a false alarm, with some parameters getting used by language bindings but
then being mistakenly passed down to XGBoost core, or some parameter actually being used
but getting flagged wrongly here. Please open an issue if you find any such cases.
But my parsing progress remains unchanged
What type of GPU are you using.
Nvidia Tesla M40 24GB * 8
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.[x] I confirm that I am using English to submit this report (Language Policy).[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).[x] Please do not modify this template :) and fill in all the required fields.
RAGFlow workspace code commit ID
RAGFlow image version
d447392(v0.17.1)
Other environment information
Actual behavior
PDF parsing failed with NCCL error using gpus, the error tracebacks are shown below: Traceback (most recent call last): File "/ragflow/rag/svr/task_executor.py", line 584, in handle_task await do_handle_task(task) File "/ragflow/rag/svr/task_executor.py", line 530, in do_handle_task token_count, vector_size = await embedding(chunks, embedding_model, task_parser_config, progress_callback) File "/ragflow/rag/svr/task_executor.py", line 392, in embedding vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 447, in to_thread_run_sync return msg_from_thread.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 373, in do_release_then_return_result return result.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 392, in worker_fn ret = context.run(sync_fn, *args) File "/ragflow/rag/svr/task_executor.py", line 392, in vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) File "<@beartype(api.db.services.llm_service.LLMBundle.encode) at 0x7dd0d97d6710>", line 31, in encode File "/ragflow/api/db/services/llm_service.py", line 240, in encode embeddings, used_tokens = self.mdl.encode(texts) File "<@beartype(rag.llm.embedding_model.DefaultEmbedding.encode) at 0x7dd0dd92ac20>", line 31, in encode File "/ragflow/rag/llm/embedding_model.py", line 104, in encode ress.extend(self._model.encode(texts[i:i + batch_size]).tolist()) File "/ragflow/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/FlagEmbedding/flag_models.py", line 96, in encode last_hidden_state = self.model(**inputs, return_dict=True).last_hidden_state File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 192, in forward replicas = self.replicate(self.module, self.device_ids[: len(inputs)]) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 199, in replicate return replicate(module, device_ids, not torch.is_grad_enabled()) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 134, in replicate param_copies = _broadcast_coalesced_reshape(params, devices, detach) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 99, in _broadcast_coalesced_reshape return comm.broadcast_coalesced(tensors, devices) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/comm.py", line 67, in broadcast_coalesced return torch._C._broadcast_coalesced(tensors, devices, buffer_size) RuntimeError: NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)
Expected behavior
No response
Steps to reproduce
Deploy docker with command
RAGFLOW_IMAGE=infiniflow/ragflow:v0.17.1 docker compose -f docker-compose-gpu.yml up -dand then upload pdf file on the website client and parse.Additional information
No response
I have successfully solved this problem,You need to install NCLL first,Need to adjust the configuration files of Docker Compose, including NCCL configuration and container shared memory size.
If you need source code compilation images for secondary development, you can add these contents in DockerFile
If you are starting with an embedding container, Deepdoc will be parsed by GPU and OOM will occur. It is necessary to adjust OCR. py to pass the CPU parsing OCR step
After adjustment, I am now able to use multiple GPU
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.[x] I confirm that I am using English to submit this report (Language Policy).[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).[x] Please do not modify this template :) and fill in all the required fields.
RAGFlow workspace code commit ID
RAGFlow image version
d447392(v0.17.1)
Other environment information
Actual behavior
PDF parsing failed with NCCL error using gpus, the error tracebacks are shown below: Traceback (most recent call last): File "/ragflow/rag/svr/task_executor.py", line 584, in handle_task await do_handle_task(task) File "/ragflow/rag/svr/task_executor.py", line 530, in do_handle_task token_count, vector_size = await embedding(chunks, embedding_model, task_parser_config, progress_callback) File "/ragflow/rag/svr/task_executor.py", line 392, in embedding vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 447, in to_thread_run_sync return msg_from_thread.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 373, in do_release_then_return_result return result.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 392, in worker_fn ret = context.run(sync_fn, *args) File "/ragflow/rag/svr/task_executor.py", line 392, in vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) File "<@beartype(api.db.services.llm_service.LLMBundle.encode) at 0x7dd0d97d6710>", line 31, in encode File "/ragflow/api/db/services/llm_service.py", line 240, in encode embeddings, used_tokens = self.mdl.encode(texts) File "<@beartype(rag.llm.embedding_model.DefaultEmbedding.encode) at 0x7dd0dd92ac20>", line 31, in encode File "/ragflow/rag/llm/embedding_model.py", line 104, in encode ress.extend(self._model.encode(texts[i:i + batch_size]).tolist()) File "/ragflow/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/FlagEmbedding/flag_models.py", line 96, in encode last_hidden_state = self.model(**inputs, return_dict=True).last_hidden_state File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 192, in forward replicas = self.replicate(self.module, self.device_ids[: len(inputs)]) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 199, in replicate return replicate(module, device_ids, not torch.is_grad_enabled()) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 134, in replicate param_copies = _broadcast_coalesced_reshape(params, devices, detach) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 99, in _broadcast_coalesced_reshape return comm.broadcast_coalesced(tensors, devices) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/comm.py", line 67, in broadcast_coalesced return torch._C._broadcast_coalesced(tensors, devices, buffer_size) RuntimeError: NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)
Expected behavior
No response
Steps to reproduce
Deploy docker with command
RAGFLOW_IMAGE=infiniflow/ragflow:v0.17.1 docker compose -f docker-compose-gpu.yml up -dand then upload pdf file on the website client and parse.Additional information
No response
I have successfully solved this problem,You need to install NCLL first,Need to adjust the configuration files of Docker Compose, including NCCL configuration and container shared memory size.
If you need source code compilation images for secondary development, you can add these contents in DockerFile
If you are starting with an embedding container, Deepdoc will be parsed by GPU and OOM will occur. It is necessary to adjust OCR. py to pass the CPU parsing OCR step After adjustment, I am now able to use multiple GPU
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.[x] I confirm that I am using English to submit this report (Language Policy).[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).[x] Please do not modify this template :) and fill in all the required fields.
RAGFlow workspace code commit ID
RAGFlow image version
d447392(v0.17.1)
Other environment information
Actual behavior
PDF parsing failed with NCCL error using gpus, the error tracebacks are shown below: Traceback (most recent call last): File "/ragflow/rag/svr/task_executor.py", line 584, in handle_task await do_handle_task(task) File "/ragflow/rag/svr/task_executor.py", line 530, in do_handle_task token_count, vector_size = await embedding(chunks, embedding_model, task_parser_config, progress_callback) File "/ragflow/rag/svr/task_executor.py", line 392, in embedding vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 447, in to_thread_run_sync return msg_from_thread.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 373, in do_release_then_return_result return result.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 392, in worker_fn ret = context.run(sync_fn, *args) File "/ragflow/rag/svr/task_executor.py", line 392, in vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) File "<@beartype(api.db.services.llm_service.LLMBundle.encode) at 0x7dd0d97d6710>", line 31, in encode File "/ragflow/api/db/services/llm_service.py", line 240, in encode embeddings, used_tokens = self.mdl.encode(texts) File "<@beartype(rag.llm.embedding_model.DefaultEmbedding.encode) at 0x7dd0dd92ac20>", line 31, in encode File "/ragflow/rag/llm/embedding_model.py", line 104, in encode ress.extend(self._model.encode(texts[i:i + batch_size]).tolist()) File "/ragflow/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/FlagEmbedding/flag_models.py", line 96, in encode last_hidden_state = self.model(**inputs, return_dict=True).last_hidden_state File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 192, in forward replicas = self.replicate(self.module, self.device_ids[: len(inputs)]) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 199, in replicate return replicate(module, device_ids, not torch.is_grad_enabled()) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 134, in replicate param_copies = _broadcast_coalesced_reshape(params, devices, detach) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 99, in _broadcast_coalesced_reshape return comm.broadcast_coalesced(tensors, devices) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/comm.py", line 67, in broadcast_coalesced return torch._C._broadcast_coalesced(tensors, devices, buffer_size) RuntimeError: NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)
Expected behavior
No response
Steps to reproduce
Deploy docker with command
RAGFLOW_IMAGE=infiniflow/ragflow:v0.17.1 docker compose -f docker-compose-gpu.yml up -dand then upload pdf file on the website client and parse.Additional information
No response
I have successfully solved this problem,You need to install NCLL first,Need to adjust the configuration files of Docker Compose, including NCCL configuration and container shared memory size.
If you need source code compilation images for secondary development, you can add these contents in DockerFile
If you are starting with an embedding container, Deepdoc will be parsed by GPU and OOM will occur. It is necessary to adjust OCR. py to pass the CPU parsing OCR step After adjustment, I am now able to use multiple GPU
I tried to set these env vars but it still report NCCL errors. –_– (NCCL version is 2.21.5.) What type of gpus do you use?
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.[x] I confirm that I am using English to submit this report (Language Policy).[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).[x] Please do not modify this template :) and fill in all the required fields.我已经搜索了现有问题,包括已关闭的问题。[x] 我确认我正在用英语提交此报告(语言政策)[x] 非英文标题的提交将会被直接关闭(语言政策)[x] 请不要修改此模板 :) 并填写所有必填字段。
RAGFlow workspace code commit ID
RAGFlow image version
d447392(v0.17.1)
Other environment information
Actual behavior
PDF parsing failed with NCCL error using gpus, the error tracebacks are shown below: Traceback (most recent call last): File "/ragflow/rag/svr/task_executor.py", line 584, in handle_task await do_handle_task(task) File "/ragflow/rag/svr/task_executor.py", line 530, in do_handle_task token_count, vector_size = await embedding(chunks, embedding_model, task_parser_config, progress_callback) File "/ragflow/rag/svr/task_executor.py", line 392, in embedding vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 447, in to_thread_run_sync return msg_from_thread.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 373, in do_release_then_return_result return result.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 392, in worker_fn ret = context.run(sync_fn, *args) File "/ragflow/rag/svr/task_executor.py", line 392, in vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) File "<@beartype(api.db.services.llm_service.LLMBundle.encode) at 0x7dd0d97d6710>", line 31, in encode File "/ragflow/api/db/services/llm_service.py", line 240, in encode embeddings, used_tokens = self.mdl.encode(texts) File "<@beartype(rag.llm.embedding_model.DefaultEmbedding.encode) at 0x7dd0dd92ac20>", line 31, in encode File "/ragflow/rag/llm/embedding_model.py", line 104, in encode ress.extend(self._model.encode(texts[i:i + batch_size]).tolist()) File "/ragflow/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/FlagEmbedding/flag_models.py", line 96, in encode last_hidden_state = self.model(**inputs, return_dict=True).last_hidden_state File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 192, in forward replicas = self.replicate(self.module, self.device_ids[: len(inputs)]) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 199, in replicate return replicate(module, device_ids, not torch.is_grad_enabled()) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 134, in replicate param_copies = _broadcast_coalesced_reshape(params, devices, detach) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 99, in _broadcast_coalesced_reshape return comm.broadcast_coalesced(tensors, devices) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/comm.py", line 67, in broadcast_coalesced return torch._C._broadcast_coalesced(tensors, devices, buffer_size) RuntimeError: NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)PDF 解析失败,使用 GPU 时出现 NCCL 错误,以下为错误回溯:回溯(最近调用最后):文件"/ragflow/rag/svr/task_executor.py",第 584 行,在 handle_task 中,await do_handle_task(task)文件"/ragflow/rag/svr/task_executor.py",第 530 行,在 do_handle_task 中,token_count, vector_size = await embedding(chunks, embedding_model, task_parser_config, progress_callback)文件"/ragflow/rag/svr/task_executor.py",第 392 行,在 embedding 中,vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size]))文件"/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py",第 447 行,在 to_thread_run_sync 中,return msg_from_thread.unwrap()文件"/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py",第 213 行,在 unwrap 中,raise captured_error 文件"/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py",第 373 行,在 do_release_then_return_result 中,return result.unwrap()文件"/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py",第 213 行,在 unwrap 中,raise captured_error 文件"/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py",第 392 行,在 worker_fn 中,ret = context.run(sync_fn, *args)文件"/ragflow/rag/svr/task_executor.py",第 392,在 vts 中,c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) 文件 "<@beartype(api.db.services.llm_service.LLMBundle.encode) at 0x7dd0d97d6710>", 行 31,在 encode 文件 "/ragflow/api/db/services/llm_service.py",行 240,在 encode embeddings, used_tokens = self.mdl.encode(texts) 文件 "<@beartype(rag.llm.embedding_model.DefaultEmbedding.encode) at 0x7dd0dd92ac20>", 行 31,在 encode 文件 "/ragflow/rag/llm/embedding_model.py",行 104,在 encode ress.extend(self._model.encode(texts[i:i + batch_size]).tolist()) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py",行 116,在 decorate_context 返回 func(*args, **kwargs) 文件 "/ragflow/.venv/lib/python3.10/site-packages/FlagEmbedding/flag_models.py",行 96,在 encode last_hidden_state = self.model(**inputs, return_dict=True).last_hidden_state 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py",行 1739,在_wrapped_call_impl 返回 self._call_impl(*args, **kwargs) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py",行 1750,在_call_impl 返回 forward_call(*args, **kwargs) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py" 第 192 行,在 forward 中,replicas = self.replicate(self.module, self.device_ids[: len(inputs)]) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py" 第 199 行,在 replicate 中,return replicate(module, device_ids, not torch.is_grad_enabled()) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py" 第 134 行,在 replicate 中,param_copies = _broadcast_coalesced_reshape(params, devices, detach) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py" 第 99 行,在 _broadcast_coalesced_reshape 中,return comm.broadcast_coalesced(tensors, devices) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/comm.py" 第 67 行,在 broadcast_coalesced 中,return torch._C._broadcast_coalesced(tensors, devices, buffer_size) 运行时错误:NCCL 错误 2:未处理的系统错误(运行 NCCL_DEBUG=INFO 以获取详细信息)
Expected behavior
No response 无响应
Steps to reproduce
Deploy docker with command
RAGFLOW_IMAGE=infiniflow/ragflow:v0.17.1 docker compose -f docker-compose-gpu.yml up -dand then upload pdf file on the website client and parse.部署 Docker 使用命令RAGFLOW_IMAGE=infiniflow/ragflow:v0.17.1 docker compose -f docker-compose-gpu.yml up -d,然后在网站客户端上传 PDF 文件并解析。Additional information
No response 无响应
I have successfully solved this problem,You need to install NCLL first,Need to adjust the configuration files of Docker Compose, including NCCL configuration and container shared memory size.我已经成功解决了这个问题,您需要先安装 NCLL,需要调整 Docker Compose 的配置文件,包括 NCCL 配置和容器共享内存大小。
If you need source code compilation images for secondary development, you can add these contents in DockerFile如果您需要用于二级开发的源代码编译镜像,您可以在 DockerFile 中添加这些内容
If you are starting with an embedding container, Deepdoc will be parsed by GPU and OOM will occur. It is necessary to adjust OCR. py to pass the CPU parsing OCR step After adjustment, I am now able to use multiple GPU
如果您是从嵌入容器开始,Deepdoc 将由 GPU 解析并发生 OOM(内存不足)。需要调整 OCR.py 以通过 CPU 解析 OCR 步骤。调整后,我现在能够使用多个 GPU
![]()
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.[x] I confirm that I am using English to submit this report (Language Policy).[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).[x] Please do not modify this template :) and fill in all the required fields.我已经搜索了现有问题,包括已关闭的问题。[x] 我确认我正在用英语提交此报告(语言政策)[x] 非英文标题的提交将会被直接关闭(语言政策)[x] 请不要修改此模板 :) 并填写所有必填字段。
RAGFlow workspace code commit ID
RAGFlow image version
d447392(v0.17.1)
Other environment information
Actual behavior
PDF parsing failed with NCCL error using gpus, the error tracebacks are shown below: Traceback (most recent call last): File "/ragflow/rag/svr/task_executor.py", line 584, in handle_task await do_handle_task(task) File "/ragflow/rag/svr/task_executor.py", line 530, in do_handle_task token_count, vector_size = await embedding(chunks, embedding_model, task_parser_config, progress_callback) File "/ragflow/rag/svr/task_executor.py", line 392, in embedding vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 447, in to_thread_run_sync return msg_from_thread.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 373, in do_release_then_return_result return result.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 392, in worker_fn ret = context.run(sync_fn, *args) File "/ragflow/rag/svr/task_executor.py", line 392, in vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) File "<@beartype(api.db.services.llm_service.LLMBundle.encode) at 0x7dd0d97d6710>", line 31, in encode File "/ragflow/api/db/services/llm_service.py", line 240, in encode embeddings, used_tokens = self.mdl.encode(texts) File "<@beartype(rag.llm.embedding_model.DefaultEmbedding.encode) at 0x7dd0dd92ac20>", line 31, in encode File "/ragflow/rag/llm/embedding_model.py", line 104, in encode ress.extend(self._model.encode(texts[i:i + batch_size]).tolist()) File "/ragflow/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/FlagEmbedding/flag_models.py", line 96, in encode last_hidden_state = self.model(**inputs, return_dict=True).last_hidden_state File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 192, in forward replicas = self.replicate(self.module, self.device_ids[: len(inputs)]) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 199, in replicate return replicate(module, device_ids, not torch.is_grad_enabled()) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 134, in replicate param_copies = _broadcast_coalesced_reshape(params, devices, detach) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 99, in _broadcast_coalesced_reshape return comm.broadcast_coalesced(tensors, devices) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/comm.py", line 67, in broadcast_coalesced return torch._C._broadcast_coalesced(tensors, devices, buffer_size) RuntimeError: NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)PDF 解析失败,使用 GPU 时出现 NCCL 错误,以下为错误回溯:回溯(最近调用最后):文件"/ragflow/rag/svr/task_executor.py",第 584 行,在 handle_task 中,await do_handle_task(task)文件"/ragflow/rag/svr/task_executor.py",第 530 行,在 do_handle_task 中,token_count, vector_size = await embedding(chunks, embedding_model, task_parser_config, progress_callback)文件"/ragflow/rag/svr/task_executor.py",第 392 行,在 embedding 中,vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size]))文件"/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py",第 447 行,在 to_thread_run_sync 中,return msg_from_thread.unwrap()文件"/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py",第 213 行,在 unwrap 中,raise captured_error 文件"/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py",第 373 行,在 do_release_then_return_result 中,return result.unwrap()文件"/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py",第 213 行,在 unwrap 中,raise captured_error 文件"/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py",第 392 行,在 worker_fn 中,ret = context.run(sync_fn, *args)文件"/ragflow/rag/svr/task_executor.py",第 392,在 vts 中,c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) 文件 "<@beartype(api.db.services.llm_service.LLMBundle.encode) at 0x7dd0d97d6710>", 行 31,在 encode 文件 "/ragflow/api/db/services/llm_service.py",行 240,在 encode embeddings, used_tokens = self.mdl.encode(texts) 文件 "<@beartype(rag.llm.embedding_model.DefaultEmbedding.encode) at 0x7dd0dd92ac20>", 行 31,在 encode 文件 "/ragflow/rag/llm/embedding_model.py",行 104,在 encode ress.extend(self._model.encode(texts[i:i + batch_size]).tolist()) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py",行 116,在 decorate_context 返回 func(*args, **kwargs) 文件 "/ragflow/.venv/lib/python3.10/site-packages/FlagEmbedding/flag_models.py",行 96,在 encode last_hidden_state = self.model(**inputs, return_dict=True).last_hidden_state 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py",行 1739,在_wrapped_call_impl 返回 self._call_impl(*args, **kwargs) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py",行 1750,在_call_impl 返回 forward_call(*args, **kwargs) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py" 第 192 行,在 forward 中,replicas = self.replicate(self.module, self.device_ids[: len(inputs)]) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py" 第 199 行,在 replicate 中,return replicate(module, device_ids, not torch.is_grad_enabled()) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py" 第 134 行,在 replicate 中,param_copies = _broadcast_coalesced_reshape(params, devices, detach) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py" 第 99 行,在 _broadcast_coalesced_reshape 中,return comm.broadcast_coalesced(tensors, devices) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/comm.py" 第 67 行,在 broadcast_coalesced 中,return torch._C._broadcast_coalesced(tensors, devices, buffer_size) 运行时错误:NCCL 错误 2:未处理的系统错误(运行 NCCL_DEBUG=INFO 以获取详细信息)
Expected behavior
No response 无响应
Steps to reproduce
Deploy docker with command
RAGFLOW_IMAGE=infiniflow/ragflow:v0.17.1 docker compose -f docker-compose-gpu.yml up -dand then upload pdf file on the website client and parse.部署 Docker 使用命令RAGFLOW_IMAGE=infiniflow/ragflow:v0.17.1 docker compose -f docker-compose-gpu.yml up -d,然后在网站客户端上传 PDF 文件并解析。Additional information
No response 无响应
I have successfully solved this problem,You need to install NCLL first,Need to adjust the configuration files of Docker Compose, including NCCL configuration and container shared memory size.我已经成功解决了这个问题,您需要先安装 NCLL,需要调整 Docker Compose 的配置文件,包括 NCCL 配置和容器共享内存大小。
If you need source code compilation images for secondary development, you can add these contents in DockerFile如果您需要用于二级开发的源代码编译镜像,您可以在 DockerFile 中添加这些内容
If you are starting with an embedding container, Deepdoc will be parsed by GPU and OOM will occur. It is necessary to adjust OCR. py to pass the CPU parsing OCR step After adjustment, I am now able to use multiple GPU
如果您是从嵌入容器开始,Deepdoc 将由 GPU 解析并发生 OOM(内存不足)。需要调整 OCR.py 以通过 CPU 解析 OCR 步骤。调整后,我现在能够使用多个 GPU
![]()
I tried to set these env vars but it still report NCCL errors. –– (NCCL version is 2.21.5.) What type of gpus do you use?我尝试设置这些环境变量,但仍然报告 NCCL 错误。--(NCCL 版本是 2.21.5。)您使用的是什么类型的 GPU?
![]()
Are you starting Docker or compiling source code ?The default shm_size is 64MB and needs to be adjusted according to your GPU memory size. I have 4090 * 8 and set 4GB, but this error did not occur during operation
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.[x] I confirm that I am using English to submit this report (Language Policy).[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).[x] Please do not modify this template :) and fill in all the required fields.我已经搜索了现有问题,包括已关闭的问题。[x] 我确认我正在用英语提交此报告(语言政策)[x] 非英文标题的提交将会被直接关闭(语言政策)[x] 请不要修改此模板 :) 并填写所有必填字段。
RAGFlow workspace code commit ID
RAGFlow image version
d447392(v0.17.1)
Other environment information
Actual behavior
PDF parsing failed with NCCL error using gpus, the error tracebacks are shown below: Traceback (most recent call last): File "/ragflow/rag/svr/task_executor.py", line 584, in handle_task await do_handle_task(task) File "/ragflow/rag/svr/task_executor.py", line 530, in do_handle_task token_count, vector_size = await embedding(chunks, embedding_model, task_parser_config, progress_callback) File "/ragflow/rag/svr/task_executor.py", line 392, in embedding vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 447, in to_thread_run_sync return msg_from_thread.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 373, in do_release_then_return_result return result.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 392, in worker_fn ret = context.run(sync_fn, *args) File "/ragflow/rag/svr/task_executor.py", line 392, in vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) File "<@beartype(api.db.services.llm_service.LLMBundle.encode) at 0x7dd0d97d6710>", line 31, in encode File "/ragflow/api/db/services/llm_service.py", line 240, in encode embeddings, used_tokens = self.mdl.encode(texts) File "<@beartype(rag.llm.embedding_model.DefaultEmbedding.encode) at 0x7dd0dd92ac20>", line 31, in encode File "/ragflow/rag/llm/embedding_model.py", line 104, in encode ress.extend(self._model.encode(texts[i:i + batch_size]).tolist()) File "/ragflow/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/FlagEmbedding/flag_models.py", line 96, in encode last_hidden_state = self.model(**inputs, return_dict=True).last_hidden_state File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 192, in forward replicas = self.replicate(self.module, self.device_ids[: len(inputs)]) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 199, in replicate return replicate(module, device_ids, not torch.is_grad_enabled()) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 134, in replicate param_copies = _broadcast_coalesced_reshape(params, devices, detach) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 99, in _broadcast_coalesced_reshape return comm.broadcast_coalesced(tensors, devices) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/comm.py", line 67, in broadcast_coalesced return torch._C._broadcast_coalesced(tensors, devices, buffer_size) RuntimeError: NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)PDF 解析失败,使用 GPU 时出现 NCCL 错误,以下为错误回溯:回溯(最近调用最后):文件"/ragflow/rag/svr/task_executor.py",第 584 行,在 handle_task 中,await do_handle_task(task)文件"/ragflow/rag/svr/task_executor.py",第 530 行,在 do_handle_task 中,token_count, vector_size = await embedding(chunks, embedding_model, task_parser_config, progress_callback)文件"/ragflow/rag/svr/task_executor.py",第 392 行,在 embedding 中,vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size]))文件"/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py",第 447 行,在 to_thread_run_sync 中,return msg_from_thread.unwrap()文件"/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py",第 213 行,在 unwrap 中,raise captured_error 文件"/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py",第 373 行,在 do_release_then_return_result 中,return result.unwrap()文件"/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py",第 213 行,在 unwrap 中,raise captured_error 文件"/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py",第 392 行,在 worker_fn 中,ret = context.run(sync_fn, *args)文件"/ragflow/rag/svr/task_executor.py",第 392,在 vts 中,c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) 文件 "<@beartype(api.db.services.llm_service.LLMBundle.encode) at 0x7dd0d97d6710>", 行 31,在 encode 文件 "/ragflow/api/db/services/llm_service.py",行 240,在 encode embeddings, used_tokens = self.mdl.encode(texts) 文件 "<@beartype(rag.llm.embedding_model.DefaultEmbedding.encode) at 0x7dd0dd92ac20>", 行 31,在 encode 文件 "/ragflow/rag/llm/embedding_model.py",行 104,在 encode ress.extend(self._model.encode(texts[i:i + batch_size]).tolist()) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py",行 116,在 decorate_context 返回 func(*args, **kwargs) 文件 "/ragflow/.venv/lib/python3.10/site-packages/FlagEmbedding/flag_models.py",行 96,在 encode last_hidden_state = self.model(**inputs, return_dict=True).last_hidden_state 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py",行 1739,在_wrapped_call_impl 返回 self._call_impl(*args, **kwargs) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py",行 1750,在_call_impl 返回 forward_call(*args, **kwargs) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py" 第 192 行,在 forward 中,replicas = self.replicate(self.module, self.device_ids[: len(inputs)]) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py" 第 199 行,在 replicate 中,return replicate(module, device_ids, not torch.is_grad_enabled()) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py" 第 134 行,在 replicate 中,param_copies = _broadcast_coalesced_reshape(params, devices, detach) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py" 第 99 行,在 _broadcast_coalesced_reshape 中,return comm.broadcast_coalesced(tensors, devices) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/comm.py" 第 67 行,在 broadcast_coalesced 中,return torch._C._broadcast_coalesced(tensors, devices, buffer_size) 运行时错误:NCCL 错误 2:未处理的系统错误(运行 NCCL_DEBUG=INFO 以获取详细信息)
Expected behavior
No response 无响应
Steps to reproduce
Deploy docker with command
RAGFLOW_IMAGE=infiniflow/ragflow:v0.17.1 docker compose -f docker-compose-gpu.yml up -dand then upload pdf file on the website client and parse.部署 Docker 使用命令RAGFLOW_IMAGE=infiniflow/ragflow:v0.17.1 docker compose -f docker-compose-gpu.yml up -d,然后在网站客户端上传 PDF 文件并解析。Additional information
No response 无响应
I have successfully solved this problem,You need to install NCLL first,Need to adjust the configuration files of Docker Compose, including NCCL configuration and container shared memory size.我已经成功解决了这个问题,您需要先安装 NCLL,需要调整 Docker Compose 的配置文件,包括 NCCL 配置和容器共享内存大小。
If you need source code compilation images for secondary development, you can add these contents in DockerFile如果您需要用于二级开发的源代码编译镜像,您可以在 DockerFile 中添加这些内容
If you are starting with an embedding container, Deepdoc will be parsed by GPU and OOM will occur. It is necessary to adjust OCR. py to pass the CPU parsing OCR step After adjustment, I am now able to use multiple GPU
如果您是从嵌入容器开始,Deepdoc 将由 GPU 解析并发生 OOM(内存不足)。需要调整 OCR.py 以通过 CPU 解析 OCR 步骤。调整后,我现在能够使用多个 GPU
![]()
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.[x] I confirm that I am using English to submit this report (Language Policy).[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).[x] Please do not modify this template :) and fill in all the required fields.我已经搜索了现有问题,包括已关闭的问题。[x] 我确认我正在用英语提交此报告(语言政策)[x] 非英文标题的提交将会被直接关闭(语言政策)[x] 请不要修改此模板 :) 并填写所有必填字段。
RAGFlow workspace code commit ID
RAGFlow image version
d447392(v0.17.1)
Other environment information
Actual behavior
PDF parsing failed with NCCL error using gpus, the error tracebacks are shown below: Traceback (most recent call last): File "/ragflow/rag/svr/task_executor.py", line 584, in handle_task await do_handle_task(task) File "/ragflow/rag/svr/task_executor.py", line 530, in do_handle_task token_count, vector_size = await embedding(chunks, embedding_model, task_parser_config, progress_callback) File "/ragflow/rag/svr/task_executor.py", line 392, in embedding vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 447, in to_thread_run_sync return msg_from_thread.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 373, in do_release_then_return_result return result.unwrap() File "/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py", line 213, in unwrap raise captured_error File "/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py", line 392, in worker_fn ret = context.run(sync_fn, *args) File "/ragflow/rag/svr/task_executor.py", line 392, in vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) File "<@beartype(api.db.services.llm_service.LLMBundle.encode) at 0x7dd0d97d6710>", line 31, in encode File "/ragflow/api/db/services/llm_service.py", line 240, in encode embeddings, used_tokens = self.mdl.encode(texts) File "<@beartype(rag.llm.embedding_model.DefaultEmbedding.encode) at 0x7dd0dd92ac20>", line 31, in encode File "/ragflow/rag/llm/embedding_model.py", line 104, in encode ress.extend(self._model.encode(texts[i:i + batch_size]).tolist()) File "/ragflow/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/FlagEmbedding/flag_models.py", line 96, in encode last_hidden_state = self.model(**inputs, return_dict=True).last_hidden_state File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl return forward_call(*args, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 192, in forward replicas = self.replicate(self.module, self.device_ids[: len(inputs)]) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 199, in replicate return replicate(module, device_ids, not torch.is_grad_enabled()) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 134, in replicate param_copies = _broadcast_coalesced_reshape(params, devices, detach) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 99, in _broadcast_coalesced_reshape return comm.broadcast_coalesced(tensors, devices) File "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/comm.py", line 67, in broadcast_coalesced return torch._C._broadcast_coalesced(tensors, devices, buffer_size) RuntimeError: NCCL Error 2: unhandled system error (run with NCCL_DEBUG=INFO for details)PDF 解析失败,使用 GPU 时出现 NCCL 错误,以下为错误回溯:回溯(最近调用最后):文件"/ragflow/rag/svr/task_executor.py",第 584 行,在 handle_task 中,await do_handle_task(task)文件"/ragflow/rag/svr/task_executor.py",第 530 行,在 do_handle_task 中,token_count, vector_size = await embedding(chunks, embedding_model, task_parser_config, progress_callback)文件"/ragflow/rag/svr/task_executor.py",第 392 行,在 embedding 中,vts, c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size]))文件"/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py",第 447 行,在 to_thread_run_sync 中,return msg_from_thread.unwrap()文件"/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py",第 213 行,在 unwrap 中,raise captured_error 文件"/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py",第 373 行,在 do_release_then_return_result 中,return result.unwrap()文件"/ragflow/.venv/lib/python3.10/site-packages/outcome/_impl.py",第 213 行,在 unwrap 中,raise captured_error 文件"/ragflow/.venv/lib/python3.10/site-packages/trio/_threads.py",第 392 行,在 worker_fn 中,ret = context.run(sync_fn, *args)文件"/ragflow/rag/svr/task_executor.py",第 392,在 vts 中,c = await trio.to_thread.run_sync(lambda: mdl.encode(cnts[i: i + batch_size])) 文件 "<@beartype(api.db.services.llm_service.LLMBundle.encode) at 0x7dd0d97d6710>", 行 31,在 encode 文件 "/ragflow/api/db/services/llm_service.py",行 240,在 encode embeddings, used_tokens = self.mdl.encode(texts) 文件 "<@beartype(rag.llm.embedding_model.DefaultEmbedding.encode) at 0x7dd0dd92ac20>", 行 31,在 encode 文件 "/ragflow/rag/llm/embedding_model.py",行 104,在 encode ress.extend(self._model.encode(texts[i:i + batch_size]).tolist()) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py",行 116,在 decorate_context 返回 func(*args, **kwargs) 文件 "/ragflow/.venv/lib/python3.10/site-packages/FlagEmbedding/flag_models.py",行 96,在 encode last_hidden_state = self.model(**inputs, return_dict=True).last_hidden_state 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py",行 1739,在_wrapped_call_impl 返回 self._call_impl(*args, **kwargs) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py",行 1750,在_call_impl 返回 forward_call(*args, **kwargs) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py" 第 192 行,在 forward 中,replicas = self.replicate(self.module, self.device_ids[: len(inputs)]) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py" 第 199 行,在 replicate 中,return replicate(module, device_ids, not torch.is_grad_enabled()) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py" 第 134 行,在 replicate 中,param_copies = _broadcast_coalesced_reshape(params, devices, detach) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/replicate.py" 第 99 行,在 _broadcast_coalesced_reshape 中,return comm.broadcast_coalesced(tensors, devices) 文件 "/ragflow/.venv/lib/python3.10/site-packages/torch/nn/parallel/comm.py" 第 67 行,在 broadcast_coalesced 中,return torch._C._broadcast_coalesced(tensors, devices, buffer_size) 运行时错误:NCCL 错误 2:未处理的系统错误(运行 NCCL_DEBUG=INFO 以获取详细信息)
Expected behavior
No response 无响应
Steps to reproduce
Deploy docker with command
RAGFLOW_IMAGE=infiniflow/ragflow:v0.17.1 docker compose -f docker-compose-gpu.yml up -dand then upload pdf file on the website client and parse.部署 Docker 使用命令RAGFLOW_IMAGE=infiniflow/ragflow:v0.17.1 docker compose -f docker-compose-gpu.yml up -d,然后在网站客户端上传 PDF 文件并解析。Additional information
No response 无响应
I have successfully solved this problem,You need to install NCLL first,Need to adjust the configuration files of Docker Compose, including NCCL configuration and container shared memory size.我已经成功解决了这个问题,您需要先安装 NCLL,需要调整 Docker Compose 的配置文件,包括 NCCL 配置和容器共享内存大小。
If you need source code compilation images for secondary development, you can add these contents in DockerFile如果您需要用于二级开发的源代码编译镜像,您可以在 DockerFile 中添加这些内容
If you are starting with an embedding container, Deepdoc will be parsed by GPU and OOM will occur. It is necessary to adjust OCR. py to pass the CPU parsing OCR step After adjustment, I am now able to use multiple GPU
如果您是从嵌入容器开始,Deepdoc 将由 GPU 解析并发生 OOM(内存不足)。需要调整 OCR.py 以通过 CPU 解析 OCR 步骤。调整后,我现在能够使用多个 GPU
![]()
I tried to set these env vars but it still report NCCL errors. –– (NCCL version is 2.21.5.) What type of gpus do you use?我尝试设置这些环境变量,但仍然报告 NCCL 错误。--(NCCL 版本是 2.21.5。)您使用的是什么类型的 GPU?
![]()
Are you starting Docker or compiling source code ?The default shm_size is 64MB and needs to be adjusted according to your GPU memory size. I have 4090 * 8 and set 4GB, but this error did not occur during operation
Really appreciate, I set shm_size="12gb" then it works. Thanks for your help.
已经更新到代码里了吗?我用gpu的yml文件也会遇到解析报错
