ragflow
ragflow copied to clipboard
[Bug]: v0.15.0-17 / Chunk Book + RAPTOR / Page(265~277): [ERROR]Internal server error while chunking
Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
RAGFlow workspace code commit ID
8939206531d8b994fd10565d4056b24ea599c1c5
RAGFlow image version
v0.15.0-17-g35580af8 slim
Other environment information
WSL Linux Ubuntu 5.15.167.4-microsoft-standard-WSL2
Actual behavior
I get this error message on a PDF with Book chunking+ RAPTOR chunking. I use local Ollama with snowflake-arctic-embed2 embedding model + RAPTOR.
Expected behavior
Logs UI
Page(265~277): [ERROR]Internal server error while chunking: Fuckedup! T:47.28228251139323,B:762.828857421875,X0:57.11962381998698,X1:534.1355387369791 ==> {x0: 79.66666666666667, top: 756.5, x1: 68.16666666666667, bottom: 765.25}
[ERROR]handle_task got exception, please check log
logs console
2024-12-22 09:31:56,363 INFO 43360 set_progress(29072140c03f11efb6e60242ac120006), progress: -1, progress_msg: **Page(265~277): [ERROR]Internal server error while chunking:
uckedup! T:47.28228251139323,B:762.828857421875,X0:57.11962381998698,X1:534.1355387369791 ==> {x0: 79.66666666666667, top: 756.5, x1: 68.16666666666667, bottom: 765.25}
2024-12-22 09:31:56,386 ERROR 43360 Chunking Rapport 2009 vol IIb.pdf got exception**
Traceback (most recent call last):
File "/ragflow/rag/svr/task_executor.py", line 209, in build_chunks
cks = chunker.chunk(task["name"], binary=binary, from_page=task["from_page"],
File "/ragflow/rag/app/book.py", line 90, in chunk
sections, tbls = pdf_parser(filename if not binary else binary,
File "/ragflow/rag/app/book.py", line 51, in __call__
tbls = self._extract_table_figure(True, zoomin, True, True)
File "/ragflow/deepdoc/parser/pdf_parser.py", line 797, in _extract_table_figure
(cropout(
File "/ragflow/deepdoc/parser/pdf_parser.py", line 778, in cropout
imgs = [cropout(arr, ltype, poss) for p, arr in pn]
File "/ragflow/deepdoc/parser/pdf_parser.py", line 778, in <listcomp>
imgs = [cropout(arr, ltype, poss) for p, arr in pn]
File "/ragflow/deepdoc/parser/pdf_parser.py", line 755, in cropout
ii = Recognizer.find_overlapped(b, louts, naive=True)
File "/ragflow/deepdoc/vision/recognizer.py", line 270, in find_overlapped
ov = Recognizer.overlapped_area(bxs[i], box)
File "/ragflow/deepdoc/vision/recognizer.py", line 147, in overlapped_area
assert x0_ <= x1_, "Fuckedup! T:{},B:{},X0:{},X1:{} ==> {}".format(
AssertionError: Fuckedup! T:47.28228251139323,B:762.828857421875,X0:57.11962381998698,X1:534.1355387369791 ==> {'x0': 79.66666666666667, 'top': 756.5, 'x1': 68.16666666666667
'bottom': 765.25}
2024-12-22 09:31:56,389 INFO 43360 set_progress(29072140c03f11efb6e60242ac120006), progress: -1, progress_msg: [ERROR]handle_task got exception, please check log
2024-12-22 09:31:56,407 ERROR 43360 handle_task got exception for task {"id": "29072140c03f11efb6e60242ac120006", "doc_id": "ee015cd0bfa011ef8da10242ac120006", "from_page"
264, "to_page": 276, "retry_count": 0, "kb_id": "d5f8d862bf9f11efb7f40242ac120006", "parser_id": "book", "parser_config": {"auto_keywords": 0, "auto_questions": 0, "raptor":
use_raptor": true, "prompt": "Please summarize the following paragraphs. Be careful with the numbers, do not make things up. Paragraphs as following:\n {cluster_content}
The above is the content you need to summarize.", "max_token": 256, "threshold": 0.1, "max_cluster": 64, "random_seed": 0}}, "name": "Rapport 2009 vol IIb.pdf", "type": "pdf", "location": "Rapport 2009 vol IIb.pdf", "size": 32880376, "t
ant_id": "ee69c9acbde111efbe830242ac120006", "language": "English", "embd_id": "snowflake-arctic-embed2:latest@Ollama", "pagerank": 0, "img2txt_id": "llama3.2-vision:latest@O
ama", "asr_id": "", "llm_id": "qwen2.5:14b@Ollama", "update_time": 1734856297612}
Traceback (most recent call last):
File "/ragflow/rag/svr/task_executor.py", line 511, in handle_task
do_handle_task(task)
File "/ragflow/rag/svr/task_executor.py", line 449, in do_handle_task
chunks = build_chunks(task, progress_callback)
File "/ragflow/rag/svr/task_executor.py", line 209, in build_chunks
cks = chunker.chunk(task["name"], binary=binary, from_page=task["from_page"],
File "/ragflow/rag/app/book.py", line 90, in chunk
sections, tbls = pdf_parser(filename if not binary else binary,
File "/ragflow/rag/app/book.py", line 51, in __call__
tbls = self._extract_table_figure(True, zoomin, True, True)
File "/ragflow/deepdoc/parser/pdf_parser.py", line 797, in _extract_table_figure
(cropout(
File "/ragflow/deepdoc/parser/pdf_parser.py", line 778, in cropout
imgs = [cropout(arr, ltype, poss) for p, arr in pn]
File "/ragflow/deepdoc/parser/pdf_parser.py", line 778, in <listcomp>
imgs = [cropout(arr, ltype, poss) for p, arr in pn]
File "/ragflow/deepdoc/parser/pdf_parser.py", line 755, in cropout
ii = Recognizer.find_overlapped(b, louts, naive=True)
File "/ragflow/deepdoc/vision/recognizer.py", line 270, in find_overlapped
ov = Recognizer.overlapped_area(bxs[i], box)
File "/ragflow/deepdoc/vision/recognizer.py", line 147, in overlapped_area
assert x0_ <= x1_, "Fuckedup! T:{},B:{},X0:{},X1:{} ==> {}".format(
**AssertionError: Fuckedup! T:47.28228251139323,B:762.828857421875,X0:57.11962381998698,X1:534.1355387369791 ==> {'x0': 79.66666666666667, 'top': 756.5, 'x1': 68.16666666666667
'bottom': 765.25}**
2024-12-22 09:31:56,856 INFO 43360 task_consumer_0 reported heartbeat: {"name": "task_consumer_0", "now": "2024-12-22T09:31:56.855135", "boot_at": "2024-12-21T23:47:57.83
17", "pending": 0, "lag": 0, "done": 23, "failed": 2, "current": null}
Steps to reproduce
idem
Additional information
No response
Could you attach this PDF file which helps to debug?
Hi Kevin, You can DL here : https://filesender.renater.fr/?s=download&token=207ddd57-7bf6-420b-9d63-0188017aead9 Kind regards, David.
have you fixed this bug?i meet the bug also