dify Migrate the database and re-vectorization with Error

Self Checks

[x] This is only for bug report, if you would like to ask a question, please head to Discussions.
[x] I have searched for existing issues search for existing issues, including closed ones.
[x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[x] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[x] Please do not modify this template :) and fill in all the required fields.

Dify version

Version 1.0.0-beta.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

After upgrading to v1.0.0, in order to use the previous data, the volume was directly migrated. After entering, there are two phenomena:

The model loading is unstable. For example, my ollama embedding model always drops offline.
The previous files need to be re-vectorized. All of them are stuck in the index. Looking at the background, it seems that there is no vectorization either.

✔️ Expected Behavior

No response

❌ Actual Behavior

 worker-1         | 2025-02-16 08:39:39,251.251 ERROR [Dummy-4] [indexing_runner.py:96] - consume document failed
2025-02-16 16:39:39 worker-1         | Traceback (most recent call last):
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/indexing_runner.py", line 73, in run
2025-02-16 16:39:39 worker-1         |     documents = self._transform(
2025-02-16 16:39:39 worker-1         |                 ^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/indexing_runner.py", line 696, in _transform
2025-02-16 16:39:39 worker-1         |     documents = index_processor.transform(
2025-02-16 16:39:39 worker-1         |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/index_processor/processor/parent_child_index_processor.py", line 71, in transform
2025-02-16 16:39:39 worker-1         |     child_nodes = self._split_child_nodes(
2025-02-16 16:39:39 worker-1         |                   ^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/index_processor/processor/parent_child_index_processor.py", line 184, in _split_child_nodes
2025-02-16 16:39:39 worker-1         |     child_documents = child_splitter.split_documents([document_node])
2025-02-16 16:39:39 worker-1         |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/splitter/text_splitter.py", line 96, in split_documents
2025-02-16 16:39:39 worker-1         |     return self.create_documents(texts, metadatas=metadatas)
2025-02-16 16:39:39 worker-1         |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/splitter/text_splitter.py", line 81, in create_documents
2025-02-16 16:39:39 worker-1         |     for chunk in self.split_text(text):
2025-02-16 16:39:39 worker-1         |                  ^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/splitter/fixed_text_splitter.py", line 71, in split_text
2025-02-16 16:39:39 worker-1         |     final_chunks.extend(self.recursive_split_text(chunk))
2025-02-16 16:39:39 worker-1         |                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/splitter/fixed_text_splitter.py", line 108, in recursive_split_text
2025-02-16 16:39:39 worker-1         |     other_info = self.recursive_split_text(s)
2025-02-16 16:39:39 worker-1         |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/splitter/fixed_text_splitter.py", line 111, in recursive_split_text
2025-02-16 16:39:39 worker-1         |     merged_text = self._merge_splits(_good_splits, separator, _good_splits_lengths)
2025-02-16 16:39:39 worker-1         |                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/splitter/text_splitter.py", line 132, in _merge_splits
2025-02-16 16:39:39 worker-1         |     total -= self._length_function([current_doc[0]])[0] + (
2025-02-16 16:39:39 worker-1         |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/splitter/fixed_text_splitter.py", line 38, in _token_encoder
2025-02-16 16:39:39 worker-1         |     return embedding_model_instance.get_text_embedding_num_tokens(texts=texts)
2025-02-16 16:39:39 worker-1         |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/model_manager.py", line 244, in get_text_embedding_num_tokens
2025-02-16 16:39:39 worker-1         |     self._round_robin_invoke(
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/model_manager.py", line 370, in _round_robin_invoke
2025-02-16 16:39:39 worker-1         |     return function(*args, **kwargs)
2025-02-16 16:39:39 worker-1         |            ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/model_runtime/model_providers/__base/text_embedding_model.py", line 65, in get_num_tokens
2025-02-16 16:39:39 worker-1         |     return plugin_model_manager.get_text_embedding_num_tokens(
2025-02-16 16:39:39 worker-1         |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/plugin/manager/model.py", line 313, in get_text_embedding_num_tokens
2025-02-16 16:39:39 worker-1         |     for resp in response:
2025-02-16 16:39:39 worker-1         |                 ^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/plugin/manager/base.py", line 189, in _request_with_plugin_daemon_response_stream
2025-02-16 16:39:39 worker-1         |     self._handle_plugin_daemon_error(error.error_type, error.message)
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/plugin/manager/base.py", line 221, in _handle_plugin_daemon_error
2025-02-16 16:39:39 worker-1         |     raise PluginInvokeError(description=message)
2025-02-16 16:39:39 worker-1         | core.plugin.manager.exc.PluginInvokeError: PluginInvokeError: {"args":{},"error_type":"RuntimeError","message":"can't start new thread"}
2025-02-16 16:39:39 worker-1         | 2025-02-16 08:39:39,302.302 INFO [Dummy-4] [retry_document_indexing_task.py:57] - Start retry document: 096241cb-3810-438a-bb51-19b49052c56e
2025-02-16 16:39:39 plugin_daemon-1  | [GIN] 2025/02/16 - 08:39:39 | 200 |    1.494959ms |      172.19.0.9 | POST     "/plugin/ed31cbda-7be7-42dc-af0a-6e4fe5ea9278/dispatch/text_embedding/num_tokens"
2025-02-16 16:39:39 worker-1         | 2025-02-16 08:39:39,380.380 ERROR [Dummy-4] [indexing_runner.py:96] - consume document failed
2025-02-16 16:39:39 worker-1         | Traceback (most recent call last):
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/indexing_runner.py", line 73, in run
2025-02-16 16:39:39 worker-1         |     documents = self._transform(
2025-02-16 16:39:39 worker-1         |                 ^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/indexing_runner.py", line 696, in _transform
2025-02-16 16:39:39 worker-1         |     documents = index_processor.transform(
2025-02-16 16:39:39 worker-1         |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/index_processor/processor/parent_child_index_processor.py", line 54, in transform
2025-02-16 16:39:39 worker-1         |     document_nodes = splitter.split_documents([document])
2025-02-16 16:39:39 worker-1         |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/splitter/text_splitter.py", line 96, in split_documents
2025-02-16 16:39:39 worker-1         |     return self.create_documents(texts, metadatas=metadatas)
2025-02-16 16:39:39 worker-1         |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/splitter/text_splitter.py", line 81, in create_documents
2025-02-16 16:39:39 worker-1         |     for chunk in self.split_text(text):
2025-02-16 16:39:39 worker-1         |                  ^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/splitter/fixed_text_splitter.py", line 68, in split_text
2025-02-16 16:39:39 worker-1         |     chunks_lengths = self._length_function(chunks)
2025-02-16 16:39:39 worker-1         |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/splitter/fixed_text_splitter.py", line 38, in _token_encoder
2025-02-16 16:39:39 worker-1         |     return embedding_model_instance.get_text_embedding_num_tokens(texts=texts)
2025-02-16 16:39:39 worker-1         |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/model_manager.py", line 244, in get_text_embedding_num_tokens
2025-02-16 16:39:39 worker-1         |     self._round_robin_invoke(
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/model_manager.py", line 370, in _round_robin_invoke
2025-02-16 16:39:39 worker-1         |     return function(*args, **kwargs)
2025-02-16 16:39:39 worker-1         |            ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/model_runtime/model_providers/__base/text_embedding_model.py", line 65, in get_num_tokens
2025-02-16 16:39:39 worker-1         |     return plugin_model_manager.get_text_embedding_num_tokens(
2025-02-16 16:39:39 worker-1         |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/plugin/manager/model.py", line 313, in get_text_embedding_num_tokens
2025-02-16 16:39:39 worker-1         |     for resp in response:
2025-02-16 16:39:39 worker-1         |                 ^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/plugin/manager/base.py", line 189, in _request_with_plugin_daemon_response_stream
2025-02-16 16:39:39 worker-1         |     self._handle_plugin_daemon_error(error.error_type, error.message)
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/plugin/manager/base.py", line 221, in _handle_plugin_daemon_error
2025-02-16 16:39:39 worker-1         |     raise PluginInvokeError(description=message)
2025-02-16 16:39:39 worker-1         | core.plugin.manager.exc.PluginInvokeError: PluginInvokeError: {"args":{},"error_type":"RuntimeError","message":"can't start new thread"}
2025-02-16 16:39:39 worker-1         | 2025-02-16 08:39:39,384.384 INFO [Dummy-4] [retry_document_indexing_task.py:57] - Start retry document: 4bee44cf-2e27-45d1-8c84-75ca3cc03b2c
2025-02-16 16:39:39 plugin_daemon-1  | [GIN] 2025/02/16 - 08:39:39 | 200 |      1.2985ms |      172.19.0.9 | POST     "/plugin/ed31cbda-7be7-42dc-af0a-6e4fe5ea9278/dispatch/text_embedding/num_tokens"
2025-02-16 16:39:39 worker-1         | 2025-02-16 08:39:39,423.423 ERROR [Dummy-4] [indexing_runner.py:96] - consume document failed
2025-02-16 16:39:39 worker-1         | Traceback (most recent call last):
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/indexing_runner.py", line 73, in run
2025-02-16 16:39:39 worker-1         |     documents = self._transform(
2025-02-16 16:39:39 worker-1         |                 ^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/indexing_runner.py", line 696, in _transform
2025-02-16 16:39:39 worker-1         |     documents = index_processor.transform(
2025-02-16 16:39:39 worker-1         |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/index_processor/processor/parent_child_index_processor.py", line 54, in transform
2025-02-16 16:39:39 worker-1         |     document_nodes = splitter.split_documents([document])
2025-02-16 16:39:39 worker-1         |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/splitter/text_splitter.py", line 96, in split_documents
2025-02-16 16:39:39 worker-1         |     return self.create_documents(texts, metadatas=metadatas)
2025-02-16 16:39:39 worker-1         |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/splitter/text_splitter.py", line 81, in create_documents
2025-02-16 16:39:39 worker-1         |     for chunk in self.split_text(text):
2025-02-16 16:39:39 worker-1         |                  ^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/splitter/fixed_text_splitter.py", line 68, in split_text
2025-02-16 16:39:39 worker-1         |     chunks_lengths = self._length_function(chunks)
2025-02-16 16:39:39 worker-1         |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/rag/splitter/fixed_text_splitter.py", line 38, in _token_encoder
2025-02-16 16:39:39 worker-1         |     return embedding_model_instance.get_text_embedding_num_tokens(texts=texts)
2025-02-16 16:39:39 worker-1         |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/model_manager.py", line 244, in get_text_embedding_num_tokens
2025-02-16 16:39:39 worker-1         |     self._round_robin_invoke(
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/model_manager.py", line 370, in _round_robin_invoke
2025-02-16 16:39:39 worker-1         |     return function(*args, **kwargs)
2025-02-16 16:39:39 worker-1         |            ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/model_runtime/model_providers/__base/text_embedding_model.py", line 65, in get_num_tokens
2025-02-16 16:39:39 worker-1         |     return plugin_model_manager.get_text_embedding_num_tokens(
2025-02-16 16:39:39 worker-1         |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/plugin/manager/model.py", line 313, in get_text_embedding_num_tokens
2025-02-16 16:39:39 worker-1         |     for resp in response:
2025-02-16 16:39:39 worker-1         |                 ^^^^^^^^
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/plugin/manager/base.py", line 189, in _request_with_plugin_daemon_response_stream
2025-02-16 16:39:39 worker-1         |     self._handle_plugin_daemon_error(error.error_type, error.message)
2025-02-16 16:39:39 worker-1         |   File "/app/api/core/plugin/manager/base.py", line 221, in _handle_plugin_daemon_error
2025-02-16 16:39:39 worker-1         |     raise PluginInvokeError(description=message)
2025-02-16 16:39:39 worker-1         | core.plugin.manager.exc.PluginInvokeError: PluginInvokeError: {"args":{},"error_type":"RuntimeError","message":"can't start new thread"}
2025-02-16 16:39:39 worker-1         | 2025-02-16 08:39:39,425.425 INFO [Dummy-4] [retry_document_indexing_task.py:96] - Retry dataset: 028d42c7-3685-4840-9648-cea862a77301 latency: 62.43543811200652
2025-02-16 16:39:39 worker-1         | 2025-02-16 08:39:39,479.479 INFO [Dummy-4] [trace.py:128] - Task tasks.retry_document_indexing_task.retry_document_indexing_task[48929b9c-9689-4e64-9c92-e9c831341486] succeeded

Feb 16 '25 08:02 GinHOW

The error you're encountering, "can't start new thread," is a RuntimeError that occurs when the system is unable to allocate a new thread. This can happen due to system resource limits, such as reaching the maximum number of threads allowed by the operating system or insufficient memory.

In the context of the Dify platform, this error is raised in the plugin/manager/base.py when handling plugin daemon errors. The PluginInvokeError is triggered when the system attempts to start a new thread but fails due to these constraints ^[1].

To address this issue, consider the following steps:

Check System Resources: Ensure that your system has enough resources (CPU, memory) to handle the workload. You might need to increase the available resources or optimize the current usage.
Review Thread Limits: Check the system's thread limits and adjust them if necessary. This might involve changing system configurations to allow more threads.
Optimize Code: Review the code to ensure that threads are being managed efficiently. Avoid creating unnecessary threads and ensure that threads are properly terminated when no longer needed.
Docker Configuration: If you're running this in a Docker environment, ensure that the container has sufficient resources allocated and that Docker's resource limits are not being exceeded.

These steps should help mitigate the issue and allow the indexing process to proceed without encountering the thread limitation error.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

Feb 16 '25 08:02 dosubot[bot]

为什么 OpenLLMLargeLanguageModel 模型计算tokens数量写死了使用 _get_num_tokens_by_gpt2 方法？我部署了本地的千问模型，接入到本地部署的 dify，中文对话，_get_num_tokens_by_gpt2 对tokens的计算会偏大很多。导致输入内容长度严重偏小。

Feb 24 '25 06:02 WenWuZhiDao

Before the update to version 1.0.0, uploading 500 files under the same hardware and software environment did not cause this issue. It is clear that version 1.0.0 has significant faults in the queuing and recycling mechanisms of new threads. Please resolve this issue as soon as possible.

Mar 01 '25 10:03 XHJ-Studio

any updates？ same error

Mar 03 '25 06:03 njwfd