dify icon indicating copy to clipboard operation
dify copied to clipboard

When upload many texts( >30,000 ) for knowledge base using api, it cost too much times to be ready.

Open shawn-maxiao opened this issue 10 months ago • 9 comments

Self Checks

  • [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • [x] Please do not modify this template :) and fill in all the required fields.

Dify version

:0.15.0

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

I have red the source to find why. And I found dify used much times to extract the keywords. I found the extract-keywords function uses Jieba to extract keywords and save the keyword tables. The save function very slow( about 2~5 seconds to save dataset keyword table for one document chunk). self._save_dataset_keyword_table(keyword_table)

And another big problem is there is one redis lock, which makes the other extract-keywords threads wait for the extract-keywords thread which is executing, this causes the process looks like one-by-one instead of parallel. lock_name = "keyword_indexing_lock_{}".format(self.dataset.id) with redis_client.lock(lock_name, timeout=600): ......

How to solve it?

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

shawn-maxiao avatar Feb 01 '25 15:02 shawn-maxiao

@shawn-maxiao Hello there! I'm here to assist you with any bugs, questions, or contributions. Let's tackle this issue together while we wait for a human maintainer.

To address the performance issues you're experiencing with the keyword extraction function in Dify, especially when dealing with a large number of texts, you can consider the following optimizations:

  1. Remove Redis Lock for Parallel Processing: The Redis lock is causing sequential processing, which can be a bottleneck. By removing the lock, you can enable parallel processing of texts. This can be achieved by processing each text in a separate thread or process, which will significantly improve performance.

  2. Batch Processing: Instead of processing each text individually, you can batch multiple texts together and process them in a single operation. This reduces the overhead of acquiring and releasing locks and can improve the efficiency of keyword extraction.

  3. Optimize Jieba Usage: Consider optimizing the usage of Jieba for keyword extraction. This might involve tuning parameters or using a more efficient algorithm or library for keyword extraction.

  4. Asynchronous Processing: Implement asynchronous processing to handle keyword extraction, which can help utilize the CPU more efficiently and lead to better performance.

Here's a conceptual example of how you might implement parallel processing by removing the Redis lock:

from concurrent.futures import ThreadPoolExecutor

def process_text(text, keyword_table_handler, config, dataset_id):
    keywords = keyword_table_handler.extract_keywords(text.page_content, config.max_keywords_per_chunk)
    if text.metadata is not None:
        _update_segment_keywords(dataset_id, text.metadata["doc_id"], list(keywords))
        return text.metadata["doc_id"], list(keywords)
    return None, None

def add_texts_parallel(self, texts: list[Document], **kwargs):
    keyword_table_handler = JiebaKeywordTableHandler()
    keyword_table = self._get_dataset_keyword_table()
    keywords_list = kwargs.get("keywords_list")

    with ThreadPoolExecutor() as executor:
        futures = []
        for i in range(len(texts)):
            text = texts[i]
            if keywords_list:
                keywords = keywords_list[i]
                if not keywords:
                    keywords = keyword_table_handler.extract_keywords(
                        text.page_content, self._config.max_keywords_per_chunk
                    )
            else:
                keywords = keyword_table_handler.extract_keywords(
                    text.page_content, self._config.max_keywords_per_chunk
                )
            futures.append(executor.submit(process_text, text, keyword_table_handler, self._config, self.dataset.id))

        for future in futures:
            doc_id, keywords = future.result()
            if doc_id and keywords:
                keyword_table = self._add_text_to_keyword_table(
                    keyword_table or {}, doc_id, keywords
                )

    self._save_dataset_keyword_table(keyword_table)

This example uses a ThreadPoolExecutor to process texts in parallel, which can improve performance by utilizing multiple CPU cores. Note that this is a conceptual example and may need adjustments based on your specific environment and requirements [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

dosubot[bot] avatar Feb 01 '25 15:02 dosubot[bot]

Could I disable the keywords extract function to avoid the problem in dify configuration?

shawn-maxiao avatar Feb 08 '25 09:02 shawn-maxiao

same problem, how to resolve?

j0rGeT avatar Feb 26 '25 09:02 j0rGeT

same problem, how to resolve? Just change the llmops platform from dify to ragflow/...

shawn-maxiao avatar Feb 27 '25 16:02 shawn-maxiao

same problem!!

imaben avatar Mar 11 '25 09:03 imaben

@JohnJyong This issue has persisted for a very long time and has a significant impact on the indexing efficiency of knowledge bases with a large number of documents. Please pay close attention to it. Thank you very much.

imaben avatar Mar 11 '25 09:03 imaben

@JohnJyong Urgently seeking a solution, or to put it another way, is there a temporary method to block it? Orz!

imaben avatar Mar 13 '25 07:03 imaben

same problem

dongjinhai avatar Mar 26 '25 03:03 dongjinhai

Hi, @shawn-maxiao. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • You reported slow performance when uploading over 30,000 texts due to sequential keyword extraction using Jieba and a Redis lock.
  • I suggested optimizations like removing the Redis lock for parallel processing and batch processing.
  • Other users, including j0rGeT, imaben, and dongjinhai, have experienced similar issues.
  • Imaben urged maintainer JohnJyong for a solution or workaround.
  • You suggested switching from the Dify platform to ragflow as a potential resolution.

Next Steps:

  • Please let us know if this issue is still relevant to the latest version of the Dify repository by commenting here.
  • If no updates are provided, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

dosubot[bot] avatar Apr 26 '25 16:04 dosubot[bot]