dify How do i know that unstructured is working as intended?

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.91

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

In the environment file, set the following configuration:

VECTOR_STORE=weaviate COMPOSE_PROFILES=${VECTOR_STORE:-weaviate},unstructured ETL_TYPE=Unstructured UNSTRUCTURED_API_URL=http://unstructured:8000/general/v0/general Upload a document, such as a .docx file.

Observe that the file indexes successfully.

Check the logs of the unstructured container and note that there is no output.

Inspect the storage for unstructured data and find it empty.

The above steps suggest a potential issue with the unstructured container functioning correctly, despite the ETL being set to unstructured successfully.

✔️ Expected Behavior

I think, files should be written to the unstructured volume.

❌ Actual Behavior

No clue on what to expect if unstructured really is working as intended.

Oct 03 '24 14:10 spencerwongfeilong

@spencerwongfeilong Hello there! I'm here to help you with any bugs, questions, or contributions you have. Let's work together to resolve any issues you're facing.

To verify that the unstructured container is working as intended in Dify, you can check the indexing status of documents using the endpoint /datasets/{dataset_id}/documents/{batch}/indexing-status. This endpoint provides information on the embedding status of the document, including fields like indexing_status, completed_segments, and total_segments, which indicate the progress of the indexing process ^[1].

If the unstructured container is not producing any logs or writing files to the unstructured volume, it might indicate a configuration issue. Ensure that the ETL_TYPE is set to Unstructured and the UNSTRUCTURED_API_URL is correctly configured in your environment file ^[2]. Additionally, verify that the storage configuration is correct, especially if you are using local storage. The STORAGE_LOCAL_PATH should be set appropriately, and the directory should be mounted correctly in your Docker setup ^[3].

If you continue to experience issues, reviewing the configuration and ensuring all environment variables are set correctly would be a good next step.

To continue talking to Dosu, mention @dosu.

Oct 03 '24 14:10 dosubot[bot]

If you were extracting .docx files, then the Unstructured.io container is not used.

Instead, the default Dify extractor is used.

The following file will explain which extractor will be used for which file type:

api/core/rag/extractor/extract_processor.py

Oct 04 '24 01:10 taowang1993

Hello, thanks for the response.

I understood dify self hosted does not support the unstructured serverless api. Is that right?

Also, the env file doesn't mention an unstructured API key variable to enter. Is that correct? I'm unsure if an API key is needed to use the docker unstructured container.

Could you provide clarity please.

Oct 07 '24 09:10 spencerwongfeilong

You need to mannually include the api key variable in your docker/.env file as follows:

ETL_TYPE=Unstructured
UNSTRUCTURED_API_URL=
UNSTRUCTURED_API_KEY=

If you are using the paid unstructured api, you need to use the corresponding URL and API key.

Oct 07 '24 20:10 taowang1993

If you were extracting .docx files, then the Unstructured.io container is not used.

Instead, the default Dify extractor is used.

The following file will explain which extractor will be used for which file type:

api/core/rag/extractor/extract_processor.py

It seems that Dify consistently uses its default PdfExtractor rather than using unstructured's pdf extraction feature (with data cleaning feature?) even when configured with ETL_TYPE=Unstructured. Is it a bug?

https://github.com/langgenius/dify/blob/bf048b8d7c709035f23cbb6bd20bc772bbb0d766/api/core/rag/extractor/extract_processor.py#L101-L105

Nov 01 '24 08:11 Wsine