v0.15.3,When upload documents to the knowledge base through the document api create_by_file, there may be situations where the documents are available without being indexed
Self Checks
- [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [x] Please do not modify this template :) and fill in all the required fields.
Dify version
0.15.3
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
Just use document's api create_by_file At first, it's worked, but then the situation shown in the title will appear
✔️ Expected Behavior
The uploaded file will Indexing and avavilable
❌ Actual Behavior
Although the status is available, the file does not have any segmentation because it becomes available immediately after the interface is called
I found two similar unsolved discussions that might be relevant to your issue:
- Knowledge base file upload text segmentation and cleaning error [1].
- Unable to automatically perform embedding processing after uploading documents using the API? [2].
Regarding your issue, there are known limitations with the create_by_file API in version 0.15.3 that might be related to your problem. The API supports only single file uploads, and files must be 15MB or smaller. Additionally, the self-hosted Docker version does not support PPTX files, even though the official API does [3].
The Dify platform determines when a document is fully indexed by checking the document's indexing status. You can verify the indexing status using the following API call:
curl --location --request GET 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{batch}/indexing-status' \
--header 'Authorization: Bearer {api_key}'
This will provide the current status of the document indexing process. The possible indexing_status values include waiting, parsing, cleaning, splitting, indexing, paused, error, and completed [4][5].
If a document is marked as available but not indexed, it might be due to an error in the indexing process. Specific error messages such as document_indexing, unsupported_file_type, or file_too_large can indicate issues [4][6].
You might want to check the indexing status and error messages to diagnose the issue further.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
Does a self-hosted service also have limitations?
Does a self-hosted service also have limitations?
No, there are some bugs in this process.
You can use the browser console to grab the parameters when you upload files using webui, and then write your api request body Just like the request body I wrote has parameters that are not declared in the interface document You will find that the mode parameter is inconsistent with what is declared in the interface specification
{
"indexing_technique": "high_quality",
"doc_form": "hierarchical_model",
"process_rule": {
"mode": "hierarchical",
"rules": {
"parent_mode": "full-doc",
"pre_processing_rules": [
{
"id": "remove_extra_spaces",
"enabled": true
},
{
"id": "remove_urls_emails",
"enabled": false
}
],
"segmentation": {
"separator": "\n\n",
"max_tokens": 500
},
"subchunk_segmentation": {
"separator": "\n",
"max_tokens": 4000
}
}
}
}