dify icon indicating copy to clipboard operation
dify copied to clipboard

It is necessary to upgrade the weaviate client.

Open jiandanfeng opened this issue 1 year ago • 5 comments

Self Checks

  • [X] I have searched for existing issues search for existing issues, including closed ones.
  • [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [X] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • [X] Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

It is necessary to upgrade the weaviate client. When using Chinese full-text search, the current weaviate version does not support Chinese word segmentation, so the effect is not good. It needs to be upgraded to version 1.24. The gse or trigram word segmentation method is used to support Chinese word segmentation and improve the ability of weaviate Chinese full-text search. weaviate tokenization link: https://weaviate.io/developers/weaviate/config-refs/schema#tokenization weaviate client update link: https://weaviate.io/developers/weaviate/client-libraries/python/v3_v4_migration#installation

2. Additional context or comments

No response

3. Can you help us with this feature?

  • [ ] I am interested in contributing to this feature.

jiandanfeng avatar Sep 06 '24 01:09 jiandanfeng

I found a similar discussion regarding the Weaviate database version and its support in Dify. You might find it relevant: Why doesn't Weaviate database use the latest version, still using version 1.19, and does Dify currently support higher versions of Weaviate? [1].

Currently, the Dify codebase uses Weaviate client version 1.19.0 [2]. The Weaviate client is integrated in the WeaviateVector class and utilized in integration tests [3][4].

For Chinese full-text search, the current implementation includes configurations for a Chinese lexer and uses Jieba for text segmentation [5]. The jieba library is used for word segmentation [6].

Upgrading to Weaviate version 1.24 to support Chinese word segmentation with gse or trigram methods would indeed improve the full-text search capabilities. If you are interested in contributing to this feature, you might want to start by updating the Weaviate client version in the docker-compose files and adjusting the relevant code to utilize the new segmentation methods.

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Sep 06 '24 01:09 dosubot[bot]

I agree that we should upgrade the Weaviate version. Version 1.2x supports different compression methods and gRPC calls, which can significantly improve performance.

Kevin9703 avatar Sep 09 '24 15:09 Kevin9703

I agree with this suggestion. With the current (default) tokenizer settings, full-text search for Japanese barely works. I tested the following steps and confirmed that the performance of full-text search in Japanese improved significantly.

  1. Set the weaviate version to 1.24.25
  2. Add USE_GSE: true to the weaviate container's environment variables
  3. Edit dify/api/core/rag/datasource/vdb/weaviate/weaviate_vector.py as follows
    def _default_schema(self, index_name: str) -> dict:
        return {
            "class": index_name,
            "properties": [
                {
                    "name": "text",
                    "dataType": ["text"],
                    "tokenization": "gse", # <- added this
                }
            ],
        }

When implementing, we also need to create a UI that switches the tokenizer depending on whether the document to be registered is in a language suitable for GSE.

fujita-h avatar Sep 29 '24 16:09 fujita-h

https://weaviate.io/developers/weaviate/search/similarity#named-vectors Named-vectors is also needed.

Scallions avatar Oct 25 '24 07:10 Scallions

Hi, @jiandanfeng. I'm Dosu, and I'm helping the Dify team manage their backlog. I'm marking this issue as stale.

Issue Summary

  • The issue was about upgrading the Weaviate client to version 1.24 for better Chinese full-text search.
  • You and other contributors, including @Kevin9703 and @fujita-h, agreed on the benefits of the upgrade.
  • @fujita-h shared successful test results and suggested UI adjustments for language-specific tokenization.
  • The issue seems resolved with the community agreeing on the improvements.

Next Steps

  • Please confirm if this issue is still relevant to the latest version of the Dify repository. If so, you can keep the discussion open by commenting here.
  • If there are no further updates, this issue will be automatically closed in 15 days.

Thank you for your understanding and contribution!

dosubot[bot] avatar Nov 25 '24 16:11 dosubot[bot]

I'm focused on #12223.

fujita-h avatar Dec 31 '24 16:12 fujita-h