Old version of .doc file can not be encoded and uploaded to knowledgebase
Self Checks
- [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [x] Please do not modify this template :) and fill in all the required fields.
Dify version
1.0
Cloud or Self Hosted
Self Hosted (Docker), Cloud
Steps to reproduce
Step1: Open knowledge base
Step 2: Click on preview Chunk
I tested the file in both official website and opensource local environment. It always happens.
✔️ Expected Behavior
No response
❌ Actual Behavior
No response
Error message:
2025-04-01 06:13:13.270 INFO [MainThread] [strategy.py:161] - Task tasks.duplicate_document_indexing_task.duplicate_document_indexing_task[126653de-ef83-4062-bcf0-c1dc7576fd4c] received
2025-04-01 06:13:13.282 INFO [Dummy-12] [duplicate_document_indexing_task.py:60] - Start process document: bdb9d202-a4c6-4f48-9088-aa55526c998d
2025-04-01 06:13:16.550 ERROR [Dummy-12] [indexing_runner.py:96] - consume document failed
Traceback (most recent call last):
File "/app/api/core/rag/extractor/text_extractor.py", line 29, in extract
text = Path(self._file_path).read_text(encoding=self._encoding)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/pathlib.py", line 1028, in read_text
return f.read()
^^^^^^^^
File "
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/app/api/core/indexing_runner.py", line 70, in run text_docs = self._extract(index_processor, dataset_document, processing_rule.to_dict()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/api/core/indexing_runner.py", line 351, in _extract text_docs = index_processor.extract(extract_setting, process_rule_mode=process_rule["mode"]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/api/core/rag/index_processor/processor/paragraph_index_processor.py", line 22, in extract text_docs = ExtractProcessor.extract( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/api/core/rag/extractor/extract_processor.py", line 161, in extract return extractor.extract() ^^^^^^^^^^^^^^^^^^^ File "/app/api/core/rag/extractor/text_extractor.py", line 32, in extract detected_encodings = detect_file_encodings(self._file_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/api/core/rag/extractor/helpers.py", line 43, in detect_file_encodings raise RuntimeError(f"Could not detect encoding for {file_path}") RuntimeError: Could not detect encoding for /tmp/tmp8z_ndsit/_h5h_hyx.doc 2025-04-01 06:13:16.557 INFO [Dummy-12] [duplicate_document_indexing_task.py:92] - Processed dataset: dccabf81-8aeb-471d-a11c-58a03cc95c2b latency: 3.285501961945556 2025-04-01 06:13:16.576 INFO [Dummy-12] [trace.py:128] - Task tasks.duplicate_document_indexing_task.duplicate_document_indexing_task[126653de-ef83-4062-bcf0-c1dc7576fd4c] succeeded in 3.3045183670474216s: None
Hey @Silow9! 👋 I'm here to help you with bugs, questions, and becoming a contributor. Let's squash those pesky bugs together!
The issue you're encountering with encoding and uploading older .doc files in Dify might be related to the requirement of using the Unstructured API for processing such files. Dify typically handles DOCX files directly, but older .doc files may need additional configurations, such as setting up UNSTRUCTURED_API_URL and UNSTRUCTURED_API_KEY. If these configurations are not set correctly, or if ETL_TYPE is set to 'dify', you might experience issues, as these settings should not be necessary in that case [1].
Additionally, ensure that you are using the Doc Extractor feature for document processing and extraction, as this can resolve issues where the system indicates that no document was uploaded [2]. If the problem persists, it might be a bug or misconfiguration in your setup.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
https://github.com/langgenius/dify/blob/0277a37fcad5ad86aeb239485c27fffd5cd90043/CONTRIBUTING.md#L23-L33
Please provide detailed steps to reproduce this error. Thanks.
Lines 23 to 33 in 0277a37
Bug reports
[!IMPORTANT] Please make sure to include the following information when submitting a bug report:
- A clear and descriptive title
- A detailed description of the bug, including any error messages
- Steps to reproduce the bug
- Expected behavior
- Logs, if available, for backend issues, this is really important, you can find them in docker-compose logs
- Screenshots or videos, if applicable Please provide detailed steps to reproduce this error. Thanks.
I have just updated the post. Please help to check it. Thank you.
https://github.com/langgenius/dify/blob/40fb4d16ef0f5e8dd984c205b9941df68a445b0f/api/core/rag/extractor/helpers.py#L19-L44
Based on this function, it seems your file is corrupt somehow. You can try save it with encoding utf-8 again to see if it works.
dify/api/core/rag/extractor/helpers.py
Lines 19 to 44 in 40fb4d1
def detect_file_encodings(file_path: str, timeout: int = 5) -> list[FileEncoding]: """Try to detect the file encoding.
Returns a list of `FileEncoding` tuples with the detected encodings ordered by confidence. Args: file_path: The path to the file to detect the encoding for. timeout: The timeout in seconds for the encoding detection. """ import chardet def read_and_detect(file_path: str) -> list[dict]: rawdata = Path(file_path).read_bytes() return cast(list[dict], chardet.detect_all(rawdata)) with concurrent.futures.ThreadPoolExecutor() as executor: future = executor.submit(read_and_detect, file_path) try: encodings = future.result(timeout=timeout) except concurrent.futures.TimeoutError: raise TimeoutError(f"Timeout reached while detecting encoding for {file_path}") if all(encoding["encoding"] is None for encoding in encodings): raise RuntimeError(f"Could not detect encoding for {file_path}") return [FileEncoding(**enc) for enc in encodings if enc["encoding"] is not None]Based on this function, it seems your file is corrupt somehow. You can try save it with encoding
utf-8again to see if it works.
In fact this file is not corrupted, it is Microsoft Word 97 - 2003 file, .doc is a binary file, not UTF-8 encoded, but we need to use the format specification (formatting protocol) to interpret these binary data, I hope that dify can directly support this kind of file uploading and parsing, instead of needing me to re-build the format of the batch file
dify/api/core/rag/extractor/helpers.py
Lines 19 to 44 in 40fb4d1
def detect_file_encodings(file_path: str, timeout: int = 5) -> list[FileEncoding]: """Try to detect the file encoding.
Returns a list of `FileEncoding` tuples with the detected encodings ordered by confidence. Args: file_path: The path to the file to detect the encoding for. timeout: The timeout in seconds for the encoding detection. """ import chardet def read_and_detect(file_path: str) -> list[dict]: rawdata = Path(file_path).read_bytes() return cast(list[dict], chardet.detect_all(rawdata)) with concurrent.futures.ThreadPoolExecutor() as executor: future = executor.submit(read_and_detect, file_path) try: encodings = future.result(timeout=timeout) except concurrent.futures.TimeoutError: raise TimeoutError(f"Timeout reached while detecting encoding for {file_path}") if all(encoding["encoding"] is None for encoding in encodings): raise RuntimeError(f"Could not detect encoding for {file_path}") return [FileEncoding(**enc) for enc in encodings if enc["encoding"] is not None]Based on this function, it seems your file is corrupt somehow. You can try save it with encoding
utf-8again to see if it works.
I tested many different .doc files, the error always happened. I suppose that maybe dify doesn't support .doc file uploading somehow.