App's embedding model cannot be synchronized/updated with the embedding model used in the dataset
Self Checks
- [X] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [X] I have searched for existing issues search for existing issues, including closed ones.
- [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [X] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [X] Please do not modify this template :) and fill in all the required fields.
Dify version
0.8.0
Cloud or Self Hosted
Self Hosted (Source)
Steps to reproduce
For example,my app was using bge-m3 as embedding model
and the only selected dataset is using Conan embedding model
error occurred while i try to chat
Update button can't solve the issue aswell
✔️ Expected Behavior
App's embedding model can be synchronized/updated with the embedding model used in the dataset
❌ Actual Behavior
App's embedding model cannot be synchronized/updated with the embedding model used in the dataset
I found a similar open issue: Enable embedding model option when creating knowledge, which was last updated on September 03, 2024. Additionally, there are two similar closed issues: Wrong embedding model displayed in datasets and 如何增加其他的embedding model呢? [1][2][3].
To synchronize or update the app's embedding model with the embedding model used in the dataset in Dify version 0.8.0, you can use the update_dataset method in the DatasetService class. This method handles updating the dataset's embedding model and ensures that the new model settings are applied correctly.
Here is the relevant code snippet from the DatasetService class:
@staticmethod
def update_dataset(dataset_id, data, user):
data.pop("partial_member_list", None)
filtered_data = {k: v for k, v in data.items() if v is not None or k == "description"}
dataset = DatasetService.get_dataset(dataset_id)
DatasetService.check_dataset_permission(dataset, user)
action = None
if dataset.indexing_technique != data["indexing_technique"]:
# if update indexing_technique
if data["indexing_technique"] == "economy":
action = "remove"
filtered_data["embedding_model"] = None
filtered_data["embedding_model_provider"] = None
filtered_data["collection_binding_id"] = None
elif data["indexing_technique"] == "high_quality":
action = "add"
# get embedding model setting
try:
model_manager = ModelManager()
embedding_model = model_manager.get_model_instance(
tenant_id=current_user.current_tenant_id,
provider=data["embedding_model_provider"],
model_type=ModelType.TEXT_EMBEDDING,
model=data["embedding_model"],
)
filtered_data["embedding_model"] = embedding_model.model
filtered_data["embedding_model_provider"] = embedding_model.provider
dataset_collection_binding = DatasetCollectionBindingService.get_dataset_collection_binding(
embedding_model.provider, embedding_model.model
)
filtered_data["collection_binding_id"] = dataset_collection_binding.id
except LLMBadRequestError:
raise ValueError(
"No Embedding Model available. Please configure a valid provider "
"in the Settings -> Model Provider."
)
else:
if (
data["embedding_model_provider"] != dataset.embedding_model_provider
or data["embedding_model"] != dataset.embedding_model
):
action = "update"
try:
model_manager = ModelManager()
embedding_model = model_manager.get_model_instance(
tenant_id=current_user.current_tenant_id,
provider=data["embedding_model_provider"],
model_type=ModelType.TEXT_EMBEDDING,
model=data["embedding_model"],
)
filtered_data["embedding_model"] = embedding_model.model
filtered_data["embedding_model_provider"] = embedding_model.provider
dataset_collection_binding = DatasetCollectionBindingService.get_dataset_collection_binding(
embedding_model.provider, embedding_model.model
)
filtered_data["collection_binding_id"] = dataset_collection_binding.id
except LLMBadRequestError:
raise ValueError(
"No Embedding Model available. Please configure a valid provider "
"in the Settings -> Model Provider."
)
except ProviderTokenNotInitError as ex:
raise ValueError(ex.description)
filtered_data["updated_by"] = user.id
filtered_data["updated_at"] = datetime.datetime.now()
# update Retrieval model
filtered_data["retrieval_model"] = data["retrieval_model"]
dataset.query.filter_by(id=dataset_id).update(filtered_data)
db.session.commit()
if action:
deal_dataset_vector_index_task.delay(dataset_id, action)
return dataset
Additionally, if you are using Replicate as your model provider, you need to follow these steps to configure the embedding model:
- Ensure you have a Replicate account and obtain an API Key.
- Pick a model from Replicate's Embedding models.
- Add the model in Dify's
Settings > Model Provider > Replicateby providing the API Key, Model Name, and Model Version.
Dify supports various embedding models, including Azure, LocalAI, MiniMax, OpenAI, Replicate, and XInference. You can select the Embeddings type in the configuration box to set up your preferred embedding model [4][5][6].
To continue talking to Dosu, mention @dosu.