dify Doc Extractor can't get anything form doc

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.13.2

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

1、in the workflow, add a doc extractor, after the execution of the document is uploaded, the return is text: ""； 2、build the Unstructured. IO service, apipost execution http://ip/general/v0/general can identify the document, Env config： ETL_TYPE=Unstructured UNSTRUCTURED_API_URL=http://ip/general/v0/general UNSTRUCTURED_API_KEY= SCARF_NO_ANALYTICS=true nothing is change！Unstructured. IO was not invoked

✔️ Expected Behavior

let Unstructured working

❌ Actual Behavior

No response

Dec 14 '24 01:12 kelicblan

Hello @kelicblan! I'm here to help you with any bugs, questions, or contributions you have for the project. If you need assistance, feel free to ask.

To address the issue where the Doc Extractor in Dify is not extracting text from documents, here are some troubleshooting steps and considerations:

Check File Type Support: Ensure that the document type you are trying to extract text from is supported by the Doc Extractor. Supported file types include plain text, PDF, DOC/DOCX, CSV, Excel, PPT/PPTX, EPUB, EML, MSG, JSON, and YAML. If the file type is unsupported, the extractor will not work ^[1].
Verify Unstructured.IO Configuration: Since you are using the Unstructured.IO service, ensure that the environment variables are correctly set:
- ETL_TYPE should be set to Unstructured.
- UNSTRUCTURED_API_URL should point to the correct API endpoint, e.g., http://ip/general/v0/general.
- Ensure that the Unstructured.IO service is running and accessible from your Dify instance ^[2].
Check for Errors in Logs: Look at the logs for any errors or warnings that might indicate why the Unstructured.IO service is not being invoked. This can provide clues if there are connectivity issues or misconfigurations.
Test Unstructured.IO Independently: Since you mentioned that the Unstructured.IO service can identify the document when tested independently, ensure that the same configuration is applied within the Dify environment.
Ensure Correct Variable Types: Make sure that the input variables to the Doc Extractor are of the correct type (ArrayFileSegment or FileSegment). Incorrect variable types can lead to failures in processing ^[1].

By following these steps, you should be able to identify and resolve the issue with the Doc Extractor not invoking the Unstructured.IO service.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

Dec 14 '24 01:12 dosubot[bot]

If the document contains images only then it won't be able to extract things from it. You will get "" as expected.

Dec 16 '24 01:12 crazywoola

@crazywoola hi, If I extract text from images and can also preview it. But when I "save & and process" , It turn to zero chunks? What is the reason? Is the content used for preview different with "save & and process" ?

Dec 17 '24 10:12 spytensor

@crazywoola hi, If I extract text from images and can also preview it. But when I "save & and process" , It turn to zero chunks? What is the reason? Is the content used for preview different with "save & and process" ?

Maybe they are use the different program logic to do the embedding, if you use image for embedding, the there should be something like image_extractor otherwise it will use the default extractor. Anyway this is off topic.

Jan 13 '25 02:01 crazywoola