When upload many texts( >30,000 ) for knowledge base using api, it cost too much times to be ready.
Self Checks
- [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [x] Please do not modify this template :) and fill in all the required fields.
Dify version
:0.15.0
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
I have red the source to find why. And I found dify used much times to extract the keywords.
I found the extract-keywords function uses Jieba to extract keywords and save the keyword tables. The save function very slow( about 2~5 seconds to save dataset keyword table for one document chunk).
self._save_dataset_keyword_table(keyword_table)
And another big problem is there is one redis lock, which makes the other extract-keywords threads wait for the extract-keywords thread which is executing, this causes the process looks like one-by-one instead of parallel.
lock_name = "keyword_indexing_lock_{}".format(self.dataset.id) with redis_client.lock(lock_name, timeout=600): ......
How to solve it?
✔️ Expected Behavior
No response
❌ Actual Behavior
No response
@shawn-maxiao Hello there! I'm here to assist you with any bugs, questions, or contributions. Let's tackle this issue together while we wait for a human maintainer.
To address the performance issues you're experiencing with the keyword extraction function in Dify, especially when dealing with a large number of texts, you can consider the following optimizations:
-
Remove Redis Lock for Parallel Processing: The Redis lock is causing sequential processing, which can be a bottleneck. By removing the lock, you can enable parallel processing of texts. This can be achieved by processing each text in a separate thread or process, which will significantly improve performance.
-
Batch Processing: Instead of processing each text individually, you can batch multiple texts together and process them in a single operation. This reduces the overhead of acquiring and releasing locks and can improve the efficiency of keyword extraction.
-
Optimize Jieba Usage: Consider optimizing the usage of Jieba for keyword extraction. This might involve tuning parameters or using a more efficient algorithm or library for keyword extraction.
-
Asynchronous Processing: Implement asynchronous processing to handle keyword extraction, which can help utilize the CPU more efficiently and lead to better performance.
Here's a conceptual example of how you might implement parallel processing by removing the Redis lock:
from concurrent.futures import ThreadPoolExecutor
def process_text(text, keyword_table_handler, config, dataset_id):
keywords = keyword_table_handler.extract_keywords(text.page_content, config.max_keywords_per_chunk)
if text.metadata is not None:
_update_segment_keywords(dataset_id, text.metadata["doc_id"], list(keywords))
return text.metadata["doc_id"], list(keywords)
return None, None
def add_texts_parallel(self, texts: list[Document], **kwargs):
keyword_table_handler = JiebaKeywordTableHandler()
keyword_table = self._get_dataset_keyword_table()
keywords_list = kwargs.get("keywords_list")
with ThreadPoolExecutor() as executor:
futures = []
for i in range(len(texts)):
text = texts[i]
if keywords_list:
keywords = keywords_list[i]
if not keywords:
keywords = keyword_table_handler.extract_keywords(
text.page_content, self._config.max_keywords_per_chunk
)
else:
keywords = keyword_table_handler.extract_keywords(
text.page_content, self._config.max_keywords_per_chunk
)
futures.append(executor.submit(process_text, text, keyword_table_handler, self._config, self.dataset.id))
for future in futures:
doc_id, keywords = future.result()
if doc_id and keywords:
keyword_table = self._add_text_to_keyword_table(
keyword_table or {}, doc_id, keywords
)
self._save_dataset_keyword_table(keyword_table)
This example uses a ThreadPoolExecutor to process texts in parallel, which can improve performance by utilizing multiple CPU cores. Note that this is a conceptual example and may need adjustments based on your specific environment and requirements [1].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other
Could I disable the keywords extract function to avoid the problem in dify configuration?
same problem, how to resolve?
same problem, how to resolve? Just change the llmops platform from dify to ragflow/...
same problem!!
@JohnJyong This issue has persisted for a very long time and has a significant impact on the indexing efficiency of knowledge bases with a large number of documents. Please pay close attention to it. Thank you very much.
@JohnJyong Urgently seeking a solution, or to put it another way, is there a temporary method to block it? Orz!
same problem
Hi, @shawn-maxiao. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.
Issue Summary:
- You reported slow performance when uploading over 30,000 texts due to sequential keyword extraction using Jieba and a Redis lock.
- I suggested optimizations like removing the Redis lock for parallel processing and batch processing.
- Other users, including j0rGeT, imaben, and dongjinhai, have experienced similar issues.
- Imaben urged maintainer JohnJyong for a solution or workaround.
- You suggested switching from the Dify platform to ragflow as a potential resolution.
Next Steps:
- Please let us know if this issue is still relevant to the latest version of the Dify repository by commenting here.
- If no updates are provided, this issue will be automatically closed in 15 days.
Thank you for your understanding and contribution!