ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Bug]: v0.15.0-17 / Chunk Book + RAPTOR / Page(265~277): [ERROR]Internal server error while chunking

Open dromeuf opened this issue 1 year ago • 3 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

RAGFlow workspace code commit ID

8939206531d8b994fd10565d4056b24ea599c1c5

RAGFlow image version

v0.15.0-17-g35580af8 slim

Other environment information

WSL Linux Ubuntu 5.15.167.4-microsoft-standard-WSL2

Actual behavior

I get this error message on a PDF with Book chunking+ RAPTOR chunking. I use local Ollama with snowflake-arctic-embed2 embedding model + RAPTOR.

Expected behavior

Logs UI

Page(265~277): [ERROR]Internal server error while chunking: Fuckedup! T:47.28228251139323,B:762.828857421875,X0:57.11962381998698,X1:534.1355387369791 ==> {x0: 79.66666666666667, top: 756.5, x1: 68.16666666666667, bottom: 765.25}
[ERROR]handle_task got exception, please check log

logs console

2024-12-22 09:31:56,363 INFO     43360 set_progress(29072140c03f11efb6e60242ac120006), progress: -1, progress_msg: **Page(265~277): [ERROR]Internal server error while chunking:
uckedup! T:47.28228251139323,B:762.828857421875,X0:57.11962381998698,X1:534.1355387369791 ==> {x0: 79.66666666666667, top: 756.5, x1: 68.16666666666667, bottom: 765.25}      
2024-12-22 09:31:56,386 ERROR    43360 Chunking Rapport 2009 vol IIb.pdf got exception**                                                                                                                                       
Traceback (most recent call last):                                                                                                                                            
  File "/ragflow/rag/svr/task_executor.py", line 209, in build_chunks                                                                                                         
    cks = chunker.chunk(task["name"], binary=binary, from_page=task["from_page"],                                                                                             
  File "/ragflow/rag/app/book.py", line 90, in chunk                                                                                                                          
    sections, tbls = pdf_parser(filename if not binary else binary,                                                                                                           
  File "/ragflow/rag/app/book.py", line 51, in __call__                                                                                                                       
    tbls = self._extract_table_figure(True, zoomin, True, True)                                                                                                               
  File "/ragflow/deepdoc/parser/pdf_parser.py", line 797, in _extract_table_figure                                                                                            
    (cropout(                                                                                                                                                                 
  File "/ragflow/deepdoc/parser/pdf_parser.py", line 778, in cropout                                                                                                          
    imgs = [cropout(arr, ltype, poss) for p, arr in pn]                                                                                                                       
  File "/ragflow/deepdoc/parser/pdf_parser.py", line 778, in <listcomp>                                                                                                       
    imgs = [cropout(arr, ltype, poss) for p, arr in pn]                                                                                                                       
  File "/ragflow/deepdoc/parser/pdf_parser.py", line 755, in cropout                                                                                                          
    ii = Recognizer.find_overlapped(b, louts, naive=True)                                                                                                                     
  File "/ragflow/deepdoc/vision/recognizer.py", line 270, in find_overlapped                                                                                                  
    ov = Recognizer.overlapped_area(bxs[i], box)                                                                                                                              
  File "/ragflow/deepdoc/vision/recognizer.py", line 147, in overlapped_area                                                                                                  
    assert x0_ <= x1_, "Fuckedup! T:{},B:{},X0:{},X1:{} ==> {}".format(                                                                                                       
AssertionError: Fuckedup! T:47.28228251139323,B:762.828857421875,X0:57.11962381998698,X1:534.1355387369791 ==> {'x0': 79.66666666666667, 'top': 756.5, 'x1': 68.16666666666667
'bottom': 765.25}                                                                                                                                                             
2024-12-22 09:31:56,389 INFO     43360 set_progress(29072140c03f11efb6e60242ac120006), progress: -1, progress_msg: [ERROR]handle_task got exception, please check log         
2024-12-22 09:31:56,407 ERROR    43360 handle_task got exception for task {"id": "29072140c03f11efb6e60242ac120006", "doc_id": "ee015cd0bfa011ef8da10242ac120006", "from_page"
264, "to_page": 276, "retry_count": 0, "kb_id": "d5f8d862bf9f11efb7f40242ac120006", "parser_id": "book", "parser_config": {"auto_keywords": 0, "auto_questions": 0, "raptor": 
use_raptor": true, "prompt": "Please summarize the following paragraphs. Be careful with the numbers, do not make things up. Paragraphs as following:\n      {cluster_content}
The above is the content you need to summarize.", "max_token": 256, "threshold": 0.1, "max_cluster": 64, "random_seed": 0}}, "name": "Rapport 2009 vol IIb.pdf", "type": "pdf", "location": "Rapport 2009 vol IIb.pdf", "size": 32880376, "t
ant_id": "ee69c9acbde111efbe830242ac120006", "language": "English", "embd_id": "snowflake-arctic-embed2:latest@Ollama", "pagerank": 0, "img2txt_id": "llama3.2-vision:latest@O
ama", "asr_id": "", "llm_id": "qwen2.5:14b@Ollama", "update_time": 1734856297612}                                                                                             
Traceback (most recent call last):                                                                                                                                            
  File "/ragflow/rag/svr/task_executor.py", line 511, in handle_task                                                                                                          
    do_handle_task(task)                                                                                                                                                      
  File "/ragflow/rag/svr/task_executor.py", line 449, in do_handle_task                                                                                                       
    chunks = build_chunks(task, progress_callback)                                                                                                                            
  File "/ragflow/rag/svr/task_executor.py", line 209, in build_chunks                                                                                                         
    cks = chunker.chunk(task["name"], binary=binary, from_page=task["from_page"],                                                                                             
  File "/ragflow/rag/app/book.py", line 90, in chunk                                                                                                                          
    sections, tbls = pdf_parser(filename if not binary else binary,                                                                                                           
  File "/ragflow/rag/app/book.py", line 51, in __call__                                                                                                                       
    tbls = self._extract_table_figure(True, zoomin, True, True)                                                                                                               
  File "/ragflow/deepdoc/parser/pdf_parser.py", line 797, in _extract_table_figure                                                                                            
    (cropout(                                                                                                                                                                 
  File "/ragflow/deepdoc/parser/pdf_parser.py", line 778, in cropout                                                                                                          
    imgs = [cropout(arr, ltype, poss) for p, arr in pn]                                                                                                                       
  File "/ragflow/deepdoc/parser/pdf_parser.py", line 778, in <listcomp>                                                                                                       
    imgs = [cropout(arr, ltype, poss) for p, arr in pn]                                                                                                                       
  File "/ragflow/deepdoc/parser/pdf_parser.py", line 755, in cropout                                                                                                          
    ii = Recognizer.find_overlapped(b, louts, naive=True)                                                                                                                     
  File "/ragflow/deepdoc/vision/recognizer.py", line 270, in find_overlapped                                                                                                  
    ov = Recognizer.overlapped_area(bxs[i], box)                                                                                                                              
  File "/ragflow/deepdoc/vision/recognizer.py", line 147, in overlapped_area                                                                                                  
    assert x0_ <= x1_, "Fuckedup! T:{},B:{},X0:{},X1:{} ==> {}".format(                                                                                                       
**AssertionError: Fuckedup! T:47.28228251139323,B:762.828857421875,X0:57.11962381998698,X1:534.1355387369791 ==> {'x0': 79.66666666666667, 'top': 756.5, 'x1': 68.16666666666667
'bottom': 765.25}**                                                                                                                                                             
2024-12-22 09:31:56,856 INFO     43360 task_consumer_0 reported heartbeat: {"name": "task_consumer_0", "now": "2024-12-22T09:31:56.855135", "boot_at": "2024-12-21T23:47:57.83
17", "pending": 0, "lag": 0, "done": 23, "failed": 2, "current": null}                                                                                                        

Steps to reproduce

idem

Additional information

No response

dromeuf avatar Dec 22 '24 09:12 dromeuf

Could you attach this PDF file which helps to debug?

KevinHuSh avatar Dec 23 '24 02:12 KevinHuSh

Hi Kevin, You can DL here : https://filesender.renater.fr/?s=download&token=207ddd57-7bf6-420b-9d63-0188017aead9 Kind regards, David.

dromeuf avatar Dec 23 '24 08:12 dromeuf

have you fixed this bug?i meet the bug also

AngGaGim avatar Mar 17 '25 06:03 AngGaGim