dify icon indicating copy to clipboard operation
dify copied to clipboard

v0.15.3,When upload documents to the knowledge base through the document api create_by_file, there may be situations where the documents are available without being indexed

Open ccooder opened this issue 9 months ago • 4 comments

Self Checks

  • [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • [x] Please do not modify this template :) and fill in all the required fields.

Dify version

0.15.3

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Just use document's api create_by_file At first, it's worked, but then the situation shown in the title will appear

✔️ Expected Behavior

The uploaded file will Indexing and avavilable

❌ Actual Behavior

Image Although the status is available, the file does not have any segmentation because it becomes available immediately after the interface is called Image

ccooder avatar Mar 07 '25 02:03 ccooder

I found two similar unsolved discussions that might be relevant to your issue:

  1. Knowledge base file upload text segmentation and cleaning error [1].
  2. Unable to automatically perform embedding processing after uploading documents using the API? [2].

Regarding your issue, there are known limitations with the create_by_file API in version 0.15.3 that might be related to your problem. The API supports only single file uploads, and files must be 15MB or smaller. Additionally, the self-hosted Docker version does not support PPTX files, even though the official API does [3].

The Dify platform determines when a document is fully indexed by checking the document's indexing status. You can verify the indexing status using the following API call:

curl --location --request GET 'https://api.dify.ai/v1/datasets/{dataset_id}/documents/{batch}/indexing-status' \
--header 'Authorization: Bearer {api_key}'

This will provide the current status of the document indexing process. The possible indexing_status values include waiting, parsing, cleaning, splitting, indexing, paused, error, and completed [4][5].

If a document is marked as available but not indexed, it might be due to an error in the indexing process. Specific error messages such as document_indexing, unsupported_file_type, or file_too_large can indicate issues [4][6].

You might want to check the indexing status and error messages to diagnose the issue further.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

dosubot[bot] avatar Mar 07 '25 02:03 dosubot[bot]

Does a self-hosted service also have limitations?

ccooder avatar Mar 07 '25 03:03 ccooder

Does a self-hosted service also have limitations?

No, there are some bugs in this process.

crazywoola avatar Mar 11 '25 02:03 crazywoola

You can use the browser console to grab the parameters when you upload files using webui, and then write your api request body Just like the request body I wrote has parameters that are not declared in the interface document You will find that the mode parameter is inconsistent with what is declared in the interface specification

{
  "indexing_technique": "high_quality",
  "doc_form": "hierarchical_model",
  "process_rule": {
    "mode": "hierarchical",
    "rules": {
      "parent_mode": "full-doc",
      "pre_processing_rules": [
        {
          "id": "remove_extra_spaces",
          "enabled": true
        },
        {
          "id": "remove_urls_emails",
          "enabled": false
        }
      ],
      "segmentation": {
        "separator": "\n\n",
        "max_tokens": 500
      },
      "subchunk_segmentation": {
        "separator": "\n",
        "max_tokens": 4000
      }
    }
  }
}

disTance7777 avatar Mar 28 '25 09:03 disTance7777