ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Bug]: If some chunks failed due to GPU OOM, the executor won't retry the failed chunk and entire task will fail.

Open ChanningZhang opened this issue 10 months ago • 2 comments

Is there an existing issue for the same bug?

  • [x] I have checked the existing issues.

RAGFlow workspace code commit ID

448fa1c

RAGFlow image version

v0.16.0

Other environment information

GPU: A10 (24G)
OS: Ubuntu 22.04
nvidia: NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6

Actual behavior

patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | 2025-02-17 20:46:19.147064849 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running Conv node. Name:'/model.19/cv2/conv/Conv' Status Message: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:123 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=8a8fe068e7eb ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size); patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | 2025-02-17 20:46:19,153 INFO 10 set_progress(a9b49952ed2c11efbe6b02420a000082), progress: -1, progress_msg: 20:46:19 Page(121~133): [ERROR]Internal server error while chunking: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Conv node. Name:/model.19/cv2/conv/Conv Status Message: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:123 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=8a8fe068e7eb ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size); patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | 2025-02-17 20:46:19,164 ERROR 10 Chunking 专利审查指南2023(官网发布版).pdf/专利审查指南2023(官网发布版).pdf got exception patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | Traceback (most recent call last): patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | File "/ragflow/rag/svr/task_executor.py", line 218, in build_chunks patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | cks = chunker.chunk(task["name"], binary=binary, from_page=task["from_page"], patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | File "/ragflow/rag/app/laws.py", line 168, in chunk patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | for txt, poss in pdf_parser(filename if not binary else binary, patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | File "/ragflow/rag/app/laws.py", line 131, in call patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | self._layouts_rec(zoomin) patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | File "/ragflow/deepdoc/parser/pdf_parser.py", line 327, in _layouts_rec patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | self.boxes, self.page_layout = self.layouter( patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | File "/ragflow/deepdoc/vision/layout_recognizer.py", line 70, in call patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | layouts = super().call(image_list, thr, batch_size) patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | File "/ragflow/deepdoc/vision/recognizer.py", line 483, in call patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | bb = self.postprocess(self.ort_sess.run(None, {k:v for k,v in ins.items() if k in self.input_names}, self.run_options)[0], ins, thr) patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | File "/ragflow/.venv/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | return self._sess.run(output_names, input_feed, run_options) patx_stack_rag-executor.1.tgcik8tmwx7j@iZbp17ecmtdol4tehvhlo7Z | onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Conv node. Name:'/model.19/cv2/conv/Conv' Status Message: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:123 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 std::conditional_t<THRW, void, onnxruntime::common::Status> onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*, const char*, int) [with ERRTYPE = cudaError; bool THRW = true; std::conditional_t<THRW, void, common::Status> = void] CUDA failure 2: out of memory ; GPU=0 ; hostname=8a8fe068e7eb ; file=/onnxruntime_src/onnxruntime/core/providers/cuda/cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size);

Expected behavior

No response

Steps to reproduce

I set WS=2 so two tasks are running parallel. In peak conditions, the total amount of GPU memory used by the two tasks exceeds 24GB, but each individual task usually uses less than 5GB. Therefore, if Ragflow can implement an OOM (Out of Memory) retry mechanism, the tasks should run successfully.

Additional information

No response

ChanningZhang avatar Feb 17 '25 13:02 ChanningZhang

onnxruntime does not support GPU well.

KevinHuSh avatar Feb 18 '25 02:02 KevinHuSh

onnxruntime does not support GPU well.

can ragflow add some kind of retry mechanism on chunk level?

ChanningZhang avatar Feb 18 '25 04:02 ChanningZhang

For gpu memory issue caused by onnx, can you try again? The OOM might be caused by the memory fragmentation, we've just found some potential solution right now.

yingfeng avatar Sep 12 '25 08:09 yingfeng