dify Old version of .doc file can not be encoded and uploaded to knowledgebase

Self Checks

[x] This is only for bug report, if you would like to ask a question, please head to Discussions.
[x] I have searched for existing issues search for existing issues, including closed ones.
[x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[x] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:)
[x] Please do not modify this template :) and fill in all the required fields.

Dify version

1.0

Cloud or Self Hosted

Self Hosted (Docker), Cloud

Steps to reproduce

Step1: Open knowledge base Step 2: Click on preview Chunk I tested the file in both official website and opensource local environment. It always happens.

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

Error message: 2025-04-01 06:13:13.270 INFO [MainThread] [strategy.py:161] - Task tasks.duplicate_document_indexing_task.duplicate_document_indexing_task[126653de-ef83-4062-bcf0-c1dc7576fd4c] received 2025-04-01 06:13:13.282 INFO [Dummy-12] [duplicate_document_indexing_task.py:60] - Start process document: bdb9d202-a4c6-4f48-9088-aa55526c998d 2025-04-01 06:13:16.550 ERROR [Dummy-12] [indexing_runner.py:96] - consume document failed Traceback (most recent call last): File "/app/api/core/rag/extractor/text_extractor.py", line 29, in extract text = Path(self._file_path).read_text(encoding=self._encoding) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/pathlib.py", line 1028, in read_text return f.read() ^^^^^^^^ File "", line 322, in decode UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/app/api/core/indexing_runner.py", line 70, in run text_docs = self._extract(index_processor, dataset_document, processing_rule.to_dict()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/api/core/indexing_runner.py", line 351, in _extract text_docs = index_processor.extract(extract_setting, process_rule_mode=process_rule["mode"]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/api/core/rag/index_processor/processor/paragraph_index_processor.py", line 22, in extract text_docs = ExtractProcessor.extract( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/api/core/rag/extractor/extract_processor.py", line 161, in extract return extractor.extract() ^^^^^^^^^^^^^^^^^^^ File "/app/api/core/rag/extractor/text_extractor.py", line 32, in extract detected_encodings = detect_file_encodings(self._file_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/api/core/rag/extractor/helpers.py", line 43, in detect_file_encodings raise RuntimeError(f"Could not detect encoding for {file_path}") RuntimeError: Could not detect encoding for /tmp/tmp8z_ndsit/_h5h_hyx.doc 2025-04-01 06:13:16.557 INFO [Dummy-12] [duplicate_document_indexing_task.py:92] - Processed dataset: dccabf81-8aeb-471d-a11c-58a03cc95c2b latency: 3.285501961945556 2025-04-01 06:13:16.576 INFO [Dummy-12] [trace.py:128] - Task tasks.duplicate_document_indexing_task.duplicate_document_indexing_task[126653de-ef83-4062-bcf0-c1dc7576fd4c] succeeded in 3.3045183670474216s: None

Mar 31 '25 06:03 Silow9

Hey @Silow9! 👋 I'm here to help you with bugs, questions, and becoming a contributor. Let's squash those pesky bugs together!

The issue you're encountering with encoding and uploading older .doc files in Dify might be related to the requirement of using the Unstructured API for processing such files. Dify typically handles DOCX files directly, but older .doc files may need additional configurations, such as setting up UNSTRUCTURED_API_URL and UNSTRUCTURED_API_KEY. If these configurations are not set correctly, or if ETL_TYPE is set to 'dify', you might experience issues, as these settings should not be necessary in that case ^[1].

Additionally, ensure that you are using the Doc Extractor feature for document processing and extraction, as this can resolve issues where the system indicates that no document was uploaded ^[2]. If the problem persists, it might be a bug or misconfiguration in your setup.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

Mar 31 '25 06:03 dosubot[bot]

https://github.com/langgenius/dify/blob/0277a37fcad5ad86aeb239485c27fffd5cd90043/CONTRIBUTING.md#L23-L33

Please provide detailed steps to reproduce this error. Thanks.

Mar 31 '25 06:03 crazywoola

dify/CONTRIBUTING.md

Lines 23 to 33 in 0277a37

Bug reports

[!IMPORTANT] Please make sure to include the following information when submitting a bug report:

A clear and descriptive title

A detailed description of the bug, including any error messages

Steps to reproduce the bug

Expected behavior

Logs, if available, for backend issues, this is really important, you can find them in docker-compose logs

Screenshots or videos, if applicable Please provide detailed steps to reproduce this error. Thanks.

I have just updated the post. Please help to check it. Thank you.

Apr 01 '25 06:04 Silow9

https://github.com/langgenius/dify/blob/40fb4d16ef0f5e8dd984c205b9941df68a445b0f/api/core/rag/extractor/helpers.py#L19-L44

Based on this function, it seems your file is corrupt somehow. You can try save it with encoding utf-8 again to see if it works.

Apr 01 '25 08:04 crazywoola

dify/api/core/rag/extractor/helpers.py

Lines 19 to 44 in 40fb4d1

def detect_file_encodings(file_path: str, timeout: int = 5) -> list[FileEncoding]: """Try to detect the file encoding.

 Returns a list of `FileEncoding` tuples with the detected encodings ordered 
 by confidence. 

 Args: 
     file_path: The path to the file to detect the encoding for. 
     timeout: The timeout in seconds for the encoding detection. 
 """ 
 import chardet 

 def read_and_detect(file_path: str) -> list[dict]: 
     rawdata = Path(file_path).read_bytes() 
     return cast(list[dict], chardet.detect_all(rawdata)) 

 with concurrent.futures.ThreadPoolExecutor() as executor: 
     future = executor.submit(read_and_detect, file_path) 
     try: 
         encodings = future.result(timeout=timeout) 
     except concurrent.futures.TimeoutError: 
         raise TimeoutError(f"Timeout reached while detecting encoding for {file_path}") 

 if all(encoding["encoding"] is None for encoding in encodings): 
     raise RuntimeError(f"Could not detect encoding for {file_path}") 
 return [FileEncoding(**enc) for enc in encodings if enc["encoding"] is not None]

Based on this function, it seems your file is corrupt somehow. You can try save it with encoding utf-8 again to see if it works.

In fact this file is not corrupted, it is Microsoft Word 97 - 2003 file, .doc is a binary file, not UTF-8 encoded, but we need to use the format specification (formatting protocol) to interpret these binary data, I hope that dify can directly support this kind of file uploading and parsing, instead of needing me to re-build the format of the batch file

Apr 01 '25 08:04 Silow9

dify/api/core/rag/extractor/helpers.py

Lines 19 to 44 in 40fb4d1

def detect_file_encodings(file_path: str, timeout: int = 5) -> list[FileEncoding]: """Try to detect the file encoding.

 Returns a list of `FileEncoding` tuples with the detected encodings ordered 
 by confidence. 

 Args: 
     file_path: The path to the file to detect the encoding for. 
     timeout: The timeout in seconds for the encoding detection. 
 """ 
 import chardet 

 def read_and_detect(file_path: str) -> list[dict]: 
     rawdata = Path(file_path).read_bytes() 
     return cast(list[dict], chardet.detect_all(rawdata)) 

 with concurrent.futures.ThreadPoolExecutor() as executor: 
     future = executor.submit(read_and_detect, file_path) 
     try: 
         encodings = future.result(timeout=timeout) 
     except concurrent.futures.TimeoutError: 
         raise TimeoutError(f"Timeout reached while detecting encoding for {file_path}") 

 if all(encoding["encoding"] is None for encoding in encodings): 
     raise RuntimeError(f"Could not detect encoding for {file_path}") 
 return [FileEncoding(**enc) for enc in encodings if enc["encoding"] is not None]

Based on this function, it seems your file is corrupt somehow. You can try save it with encoding utf-8 again to see if it works.

I tested many different .doc files, the error always happened. I suppose that maybe dify doesn't support .doc file uploading somehow.

Apr 01 '25 08:04 Silow9