Unstructured API Calls when Dify selected as ETL
Self Checks
- [X] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [X] I have searched for existing issues search for existing issues, including closed ones.
- [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [X] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [X] Please do not modify this template :) and fill in all the required fields.
Dify version
0.11.1
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
Self-hosting using Docker Compose, ETL_TYPE set to "dify." No uploaded knowledge, no active flows.
✔️ Expected Behavior
Nothing - I wouldn't expect Dify to interact with the unstructured.io URL at all.
❌ Actual Behavior
Every 12-13 seconds there are two DNS lookups up from the server to unstructured.io per my local DNS server. Continues whether you set an unstructured API url or not.
I couldn't find specific information in the available data about why DNS lookups to unstructured.io occur when ETL_TYPE is set to "dify" in Dify version 0.11.1. This behavior might be due to a configuration issue or a bug in the application. It would be beneficial to check the Dify GitHub repository for any related issues or updates that might address this unexpected behavior.
To continue talking to Dosu, mention @dosu.
Checking the logs, it looks like my worker is crashing, and when it reboots, it calls unstructured.io. Why does the worker make an unstructured.io call when unstructured is unused?
The Dify codebase includes logging and error handling that might help diagnose why the worker is making calls to unstructured.io. The AdvancedChatAppGenerator class logs exceptions during the generation process, which could provide insights into the worker's behavior. Additionally, Dify integrates with Sentry for error tracking, which can capture and report errors, potentially shedding light on the reasons behind the worker's actions, including any calls to external services like unstructured.io [1][2].
If the worker is crashing and then making these calls upon reboot, it might be worth investigating the logs and Sentry reports for any errors or exceptions that occur just before or after the crash. This could help identify if there's a specific part of the code or a configuration that inadvertently triggers these calls.
To continue talking to Dosu, mention @dosu.
This is because the unstructured.io package enables telemetry by default.
Even when self-hosting with ETL_TYPE=Unstructured, it will make network requests.
There is an environment variable to disable telemetry. Consider adding it to the API configuration:
environment:
SCARF_NO_ANALYTICS: true
Ref: https://docs.unstructured.io/api-reference/api-services/free-api#telemetry
ref #11179 Fixed.
Thanks Zolgear