App's embedding model cannot be synchronized/updated with the embedding model used in the dataset

Open soulzzz opened this issue 1 year ago • 1 comments

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.8.0

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

For example,my app was using bge-m3 as embedding model

and the only selected dataset is using Conan embedding model

error occurred while i try to chat

Update button can't solve the issue aswell

✔️ Expected Behavior

App's embedding model can be synchronized/updated with the embedding model used in the dataset

❌ Actual Behavior

App's embedding model cannot be synchronized/updated with the embedding model used in the dataset

Sep 11 '24 02:09 soulzzz

I found a similar open issue: Enable embedding model option when creating knowledge, which was last updated on September 03, 2024. Additionally, there are two similar closed issues: Wrong embedding model displayed in datasets and 如何增加其他的embedding model呢？ ^[1]^[2]^[3].

To synchronize or update the app's embedding model with the embedding model used in the dataset in Dify version 0.8.0, you can use the update_dataset method in the DatasetService class. This method handles updating the dataset's embedding model and ensures that the new model settings are applied correctly.

Here is the relevant code snippet from the DatasetService class:

@staticmethod
def update_dataset(dataset_id, data, user):
    data.pop("partial_member_list", None)
    filtered_data = {k: v for k, v in data.items() if v is not None or k == "description"}
    dataset = DatasetService.get_dataset(dataset_id)
    DatasetService.check_dataset_permission(dataset, user)
    action = None
    if dataset.indexing_technique != data["indexing_technique"]:
        # if update indexing_technique
        if data["indexing_technique"] == "economy":
            action = "remove"
            filtered_data["embedding_model"] = None
            filtered_data["embedding_model_provider"] = None
            filtered_data["collection_binding_id"] = None
        elif data["indexing_technique"] == "high_quality":
            action = "add"
            # get embedding model setting
            try:
                model_manager = ModelManager()
                embedding_model = model_manager.get_model_instance(
                    tenant_id=current_user.current_tenant_id,
                    provider=data["embedding_model_provider"],
                    model_type=ModelType.TEXT_EMBEDDING,
                    model=data["embedding_model"],
                )
                filtered_data["embedding_model"] = embedding_model.model
                filtered_data["embedding_model_provider"] = embedding_model.provider
                dataset_collection_binding = DatasetCollectionBindingService.get_dataset_collection_binding(
                    embedding_model.provider, embedding_model.model
                )
                filtered_data["collection_binding_id"] = dataset_collection_binding.id
            except LLMBadRequestError:
                raise ValueError(
                    "No Embedding Model available. Please configure a valid provider "
                    "in the Settings -> Model Provider."
                )
    else:
        if (
            data["embedding_model_provider"] != dataset.embedding_model_provider
            or data["embedding_model"] != dataset.embedding_model
        ):
            action = "update"
            try:
                model_manager = ModelManager()
                embedding_model = model_manager.get_model_instance(
                    tenant_id=current_user.current_tenant_id,
                    provider=data["embedding_model_provider"],
                    model_type=ModelType.TEXT_EMBEDDING,
                    model=data["embedding_model"],
                )
                filtered_data["embedding_model"] = embedding_model.model
                filtered_data["embedding_model_provider"] = embedding_model.provider
                dataset_collection_binding = DatasetCollectionBindingService.get_dataset_collection_binding(
                    embedding_model.provider, embedding_model.model
                )
                filtered_data["collection_binding_id"] = dataset_collection_binding.id
            except LLMBadRequestError:
                raise ValueError(
                    "No Embedding Model available. Please configure a valid provider "
                    "in the Settings -> Model Provider."
                )
            except ProviderTokenNotInitError as ex:
                raise ValueError(ex.description)
    filtered_data["updated_by"] = user.id
    filtered_data["updated_at"] = datetime.datetime.now()

    # update Retrieval model
    filtered_data["retrieval_model"] = data["retrieval_model"]

    dataset.query.filter_by(id=dataset_id).update(filtered_data)

    db.session.commit()
    if action:
        deal_dataset_vector_index_task.delay(dataset_id, action)
    return dataset

Additionally, if you are using Replicate as your model provider, you need to follow these steps to configure the embedding model:

Ensure you have a Replicate account and obtain an API Key.
Pick a model from Replicate's Embedding models.
Add the model in Dify's Settings > Model Provider > Replicate by providing the API Key, Model Name, and Model Version.

Dify supports various embedding models, including Azure, LocalAI, MiniMax, OpenAI, Replicate, and XInference. You can select the Embeddings type in the configuration box to set up your preferred embedding model ^[4]^[5]^[6].

To continue talking to Dosu, mention @dosu.

Sep 11 '24 02:09 dosubot[bot]