ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Bug]: Error when parsing .DOCX files when chunking method is set to Laws

Open rplescia opened this issue 1 year ago • 3 comments

Is there an existing issue for the same bug?

  • [X] I have checked the existing issues.

Branch name

main

Commit ID

na

Other environment information

No response

Actual behavior

When the chunking method is set to "Laws," I cannot parse an MS Word (DOCX) document. I have tried different embedding models and chucking parameters, but it still fails. When the document is converted to PDF format, it parses fine. Capture

Expected behavior

No response

Steps to reproduce

Set up new knowledgebase with the default chunking method to Laws.
Upload a DOCX file and start parsing

Additional information

No response

rplescia avatar Oct 29 '24 12:10 rplescia

Could you attach the file so I can debug it?

KevinHuSh avatar Oct 30 '24 01:10 KevinHuSh

Unfortunately, I cannot send the exact document because it is confidential, I will see if I can find a sample document that exhibits the same behaviour. The type of document I'm using is a facility agreement, like this https://assets.publishing.service.gov.uk/media/5a7f05b0e5274a2e8ab49acc/facility-agreement.pdf or this https://www.sec.gov/Archives/edgar/data/1415016/000119312514260282/d699526dex99b25.htm

rplescia avatar Oct 30 '24 13:10 rplescia

@KevinHuSh If it is any help, the same error occurs when chucking the documents using 'One' and 'Manual' methods. I still haven't figured out what about the document can be causing the issue, I have managed to find the piece of code that produces the error. My guess is that in one of the parameters it is passing a string where it is expecting an int value

try: cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"], to_page=row["to_page"], lang=row["language"], callback=callback, kb_id=row["kb_id"], parser_config=row["parser_config"], tenant_id=row["tenant_id"]) cron_logger.info( "Chunking({}) {}/{}".format(timer() - st, row["location"], row["name"])) except Exception as e: callback(-1, "Internal server error while chunking: %s" % str(e).replace("'", "")) cron_logger.error( "Chunking {}/{}: {}".format(row["location"], row["name"], str(e))) traceback.print_exc() return

rplescia avatar Nov 06 '24 16:11 rplescia

@rplescia I cannot reproduce with RAGFlow v0.14.1 and the facility-agreement.pdf you uploaded. Please try RAGFlow v0.14.1 which has much better log. If you are able to reproduce, paste the error log here please.

yuzhichang avatar Dec 02 '24 10:12 yuzhichang