dify icon indicating copy to clipboard operation
dify copied to clipboard

retrieve API cannot run in thread mode

Open kksasa opened this issue 1 year ago • 2 comments

Self Checks

  • [X] This is only for bug report, if you would like to ask a question, please head to Discussions.
  • [X] I have searched for existing issues search for existing issues, including closed ones.
  • [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [X] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • [X] Please do not modify this template :) and fill in all the required fields.

Dify version

Version 0.11.2

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

It seems this retrieve API cannot run in thread mode? I tried below code, find costs around 20s no matter using thread mode.

url = f'http://10.233.40.131/v1/datasets/{dataset_id}/retrieve'

def dify_retrieve(query, dataset_names:list, top_k = 10, score_threshold = 0.4, use_thread=True):
    threads_search = []
    results:list[Document]  = []  
    if use_thread:
        for dataset_name in dataset_names:
            thread = threading.Thread(target=lambda: results.extend(retrieve_documents(dataset_name,query, top_k, score_threshold)))        
            threads_search.append(thread)
            thread.start() 
        for thread in threads_search:
            thread.join()
    else:
        for dataset_name in dataset_names:
            results.extend(retrieve_documents(dataset_name,query, top_k, score_threshold))    
    return results 

### ✔️ Expected Behavior

should be faster in thread way

### ❌ Actual Behavior

retrieve too slow for multi datasets

kksasa avatar Dec 11 '24 08:12 kksasa

The retrieve API running slowly in thread mode could be due to several factors:

  1. Thread Overhead: Creating a new thread for each dataset can lead to significant overhead, especially if there are many datasets. This overhead comes from managing multiple threads and context switching [1].

  2. Database Access: Each thread might be accessing the database simultaneously, which can cause contention and slow down the retrieval process if the database queries are not optimized [1].

  3. Synchronous Thread Joining: The use of thread.join() means that the main thread waits for all threads to complete. If any thread takes longer due to heavy computation or slow I/O, it will delay the entire process [1].

To improve performance, consider the following optimizations:

  • Reduce Thread Count: Limit the number of threads to a reasonable number to reduce overhead.
  • Optimize Database Queries: Ensure that database queries are efficient and indexed properly.
  • Use Asynchronous I/O: Instead of threads, consider using asynchronous I/O operations which can be more efficient for I/O-bound tasks.

These changes might help in achieving better performance when using the retrieve API in thread mode.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

dosubot[bot] avatar Dec 11 '24 08:12 dosubot[bot]

Hi

Thanks for the answer, I think the blocker is due to multiple API reqeusts access at the same time. So I wonder if dify can provide a new retrieve API allow we pass dataset ids but now exsiting API just can do one single dataset query.

e,.g

curl --location --request POST 'http://xxx/v1/datasets/retrieve'
--header 'Authorization: Bearer {api_key}'
--header 'Content-Type: application/json'
--data-raw '{ "query": "test", "retrieval_model": { "dataset_ids": ["dataset_id1","dataset_id2",....] , -->>>> pass dataset ids here "search_method": "keyword_search", "reranking_enable": false, "reranking_mode": null, "reranking_model": { "reranking_provider_name": "", "reranking_model_name": "" }, "weights": null, "top_k": 1, "score_threshold_enabled": false, "score_threshold": null } }'

kksasa avatar Dec 16 '24 02:12 kksasa

Hi, @kksasa. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • Performance inefficiency reported in the retrieve API of Dify version 0.11.2 when self-hosted using Docker.
  • Threading does not improve speed as expected; potential causes include thread overhead and database access contention.
  • Suggested optimizations include reducing thread count, optimizing database queries, and using asynchronous I/O.
  • You proposed a new API feature to pass multiple dataset IDs in a single request to reduce simultaneous API requests.

Next Steps:

  • Please confirm if this issue is still relevant to the latest version of the Dify repository. If so, you can keep the discussion open by commenting on the issue.
  • If there is no further activity, the issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

dosubot[bot] avatar Jan 21 '25 16:01 dosubot[bot]