[Bug]: Corrupt text in chunks

Open stevenguan08 opened this issue 8 months ago • 1 comments

Self Checks

[x] I have searched for existing issues search for existing issues, including closed ones.
[x] I confirm that I am using English to submit this report (Language Policy).
[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
[x] Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

2d752383b18a8ec254b070f3d551bd484c065e8b

RAGFlow image version

17.2

Other environment information

The PDF file appears clean and clear when viewed directly. However, when the content is extracted into chunks (likely for processing or analysis), the chunks contain gibberish words or corrupted text. This issue is consistent across almost every chunk, indicating a systemic problem in the extraction process. 
Details  

    Source of the Problem : 
        The original PDF file is visually clean and readable.
        The issue arises during the conversion or extraction of the PDF content into text chunks.
         

    Nature of the Problem : 
        The extracted chunks contain nonsensical or corrupted text, which does not match the content visible in the PDF.
        This corruption affects nearly all chunks, suggesting that the issue is not isolated but rather pervasive.
         

    Impact : 
        The corrupted chunks make it impossible to accurately analyze or use the extracted text for further processing.
        This impacts downstream tasks such as keyword extraction, question answering, or any other application relying on the integrity of the text data.
         

    Relevant Context : 
        The PDF contains structured content, including sections, headings, and paragraphs.
        The gibberish appears to replace or distort the original text, making it unreadable and unusable.

Actual behavior

The extracted chunks contain nonsensical or corrupted text that does not match the clean and readable content visible in the PDF. Words and phrases are often distorted, with characters replaced by symbols, random letters, or sequences that do not form coherent words. Some chunks may also include overlapping or repeated text, further complicating readability. For example:

Expected (Clean Text) :

Actual (Corrupted Text) :

第七章美治时期菲华法律和经济地位284 第二节美国排华法案引入菲律宾群岛…暥暦290暋棽椆棸一暍暥排华法案暦引入的背景暍菲律宾社会的排华情绪和原因…………二暍民治政府的成立及暥排华法案暦的通过和实施…棽椆棻…三暍法案实施的成效及后来对菲律宾移民法律的影响……棽椆椀……四暍中国领事对排华的交涉……棽椆椄…第三节法案的成效及各方反应……………棾棸棸三暍华人对法案的规避…棾棸椃四暍法案实施期间移民的规模棾棻棻第四节排华法案对菲华社会经济的影响……棾棻棿暋暥暦一暍法案对菲华社会结构组成带来的影响棾棻椆二暍法案对菲律宾经济结构带来的影响………棾棻椆三暍对法案的抗争使侨社团结暍菲华社会政治主动性提高……棾棽棽棾棽椄

The gibberish appears randomly throughout the chunks, making it difficult to identify patterns or pinpoint specific causes. This behavior is consistent across nearly all chunks, regardless of the content type (headings, paragraphs, lists, etc.).

Expected behavior

Expected Behavior

The extracted chunks should accurately reflect the clean and clear text visible in the PDF.
There should be no gibberish or corrupted text in the chunks.

Steps to reproduce

Steps to Reproduce  

    Open the PDF file and verify its clarity.
    Extract the content into chunks using the current method.
    Inspect the chunks and observe the presence of gibberish words.

Additional information

Apr 03 '25 05:04 stevenguan08