I have several plain text docx files. After being cleaned by dify, only a small amount of text was read and a lot of text was lost.

Open PsycheJ opened this issue 1 year ago • 1 comments

Self Checks

[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

I have several plain text docx files. After being cleaned by dify, only a small amount of text was read and a lot of text was lost.I remember it was fully readable in earlier versions of dify. 1、行政处罚案件档案整理规则.docx QQ截图20240829111255 QQ截图20240829111305

2. Additional context or comments

No response

3. Can you help us with this feature?

[ ] I am interested in contributing to this feature.

Aug 29 '24 03:08 PsycheJ

A lot of text may be lost when cleaning plain text DOCX files with Dify due to the preprocessing rules applied during the cleaning process. Specifically, the rules for replacing continuous spaces, newlines, and tabs, as well as deleting all URLs and email addresses, can result in significant text loss if not configured properly.

Supporting evidence from the document:

Preprocessing rules:

Replace continuous spaces, newlines, and tabs;

Delete all URLs and email addresses;

Additionally, the CleanProcessor class in Dify's codebase shows that during the extraction and transformation phases, text documents are cleaned and split according to the processing rules. This includes removing extra spaces, URLs, and emails, which can inadvertently remove or alter significant portions of the text, leading to the observed loss of content ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

Aug 29 '24 03:08 dosubot[bot]