dify icon indicating copy to clipboard operation
dify copied to clipboard

feat: support tencent vector db

Open quicksandznzn opened this issue 10 months ago • 15 comments

Description

Support Tencent Vector DB

dependencies

tcvectordb==1.3.2

Type of Change

Please delete options that are not relevant.

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [ ] This change requires a documentation update, included: Dify Document
  • [ ] Improvement, including but not limited to code refactoring, performance optimization, and UI/UX improvement
  • [ ] Dependency upgrade

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • [ ] TODO

Suggested Checklist:

  • [ ] I have performed a self-review of my own code
  • [ ] I have commented my code, particularly in hard-to-understand areas
  • [ ] My changes generate no new warnings
  • [ ] I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods
  • [ ] optional I have made corresponding changes to the documentation
  • [ ] optional I have added tests that prove my fix is effective or that my feature works
  • [ ] optional New and existing unit tests pass locally with my changes

quicksandznzn avatar Apr 17 '24 11:04 quicksandznzn

-1 for supporting tencent vdb

  1. tencent vdb , or Tencent Cloud VectorDB (https://cloud.tencent.com/product/vdb), is not an open-sourced vector db , which leads to less testability to test against the target vdb instance. The code will easily come to an idle status.
  2. missing required tcvectordb python package in requirements.txt.
  3. the package tcvectordb provides no information for usage and requirements on Pypl public repo, according to https://pypi.org/project/tcvectordb/
  4. never put .env file to the PR

bowenliang123 avatar Apr 17 '24 13:04 bowenliang123

how is it going?

wade30822 avatar Apr 25 '24 07:04 wade30822

how is it going?

wait review ~

quicksandznzn avatar Apr 25 '24 07:04 quicksandznzn

  1. please resolve the sytle violation in Python code by running dev/reformat.
  2. move the tests to api/tests/integration_tests/vdb/tcvectordb

bowenliang123 avatar Apr 25 '24 08:04 bowenliang123

  1. please resolve the sytle violation in Python code by running dev/reformat.
  2. move the tests to api/tests/integration_tests/vdb/tcvectordb

done~

quicksandznzn avatar Apr 25 '24 08:04 quicksandznzn

ok, thx.

bowenliang123 avatar Apr 25 '24 08:04 bowenliang123

@quicksandznzn In the method def search_by_vector(), it return the results without score_threshold filtering, and tcvectordb won't return the score. Score_threshold won't work as expect. For example, you want the agent to ask for more information instead of giving an irrelevant reply. It seems a problem of the sdk.

zeroameli avatar Apr 25 '24 10:04 zeroameli

@quicksandznzn In the method def search_by_vector(), it return the results without score_threshold filtering, and tcvectordb won't return the score. Score_threshold won't work as expect. For example, you want the agent to ask for more information instead of giving an irrelevant reply. It seems a problem of the sdk.

@quicksandznzn In the method def search_by_vector(), it return the results without score_threshold filtering, and tcvectordb won't return the score. Score_threshold won't work as expect. For example, you want the agent to ask for more information instead of giving an irrelevant reply. It seems a problem of the sdk.

the sdk is fine , it has returned the score

JohnJyong avatar Apr 25 '24 10:04 JohnJyong

        score_threshold = kwargs.get("score_threshold", .0) if kwargs.get('score_threshold', .0) else 0.0
        return self._get_search_res(res, score_threshold)
    def _get_search_res(self, res, score_threshold):
        docs = []
        if res is None or len(res) == 0:
            return docs

        for result in res[0]:
            meta = result.get(self.field_metadata)
            if meta is not None:
                meta = json.loads(meta)
            score = 1 - result.get("score")
            if score > score_threshold:
                meta['score'] = score
                doc = Document(page_content=result.get(self.field_text), metadata=meta)
                docs.append(doc)
        return docs

JohnJyong avatar Apr 25 '24 11:04 JohnJyong

        score_threshold = kwargs.get("score_threshold", .0) if kwargs.get('score_threshold', .0) else 0.0
        return self._get_search_res(res, score_threshold)
    def _get_search_res(self, res, score_threshold):
        docs = []
        if res is None or len(res) == 0:
            return docs

        for result in res[0]:
            meta = result.get(self.field_metadata)
            if meta is not None:
                meta = json.loads(meta)
            score = 1 - result.get("score")
            if score > score_threshold:
                meta['score'] = score
                doc = Document(page_content=result.get(self.field_text), metadata=meta)
                docs.append(doc)
        return docs

thanks , optimized

quicksandznzn avatar Apr 26 '24 01:04 quicksandznzn

        score_threshold = kwargs.get("score_threshold", .0) if kwargs.get('score_threshold', .0) else 0.0
        return self._get_search_res(res, score_threshold)
    def _get_search_res(self, res, score_threshold):
        docs = []
        if res is None or len(res) == 0:
            return docs

        for result in res[0]:
            meta = result.get(self.field_metadata)
            if meta is not None:
                meta = json.loads(meta)
            score = 1 - result.get("score")
            if score > score_threshold:
                meta['score'] = score
                doc = Document(page_content=result.get(self.field_text), metadata=meta)
                docs.append(doc)
        return docs

thanks , optimized

Optimized,Refer to your suggestions

quicksandznzn avatar Apr 26 '24 02:04 quicksandznzn

@quicksandznzn Hello please add my wechat crazyphage, I will invite. you to our contributors' group

crazywoola avatar Apr 26 '24 12:04 crazywoola

crazyphage

yep

quicksandznzn avatar Apr 28 '24 01:04 quicksandznzn

This branch has conflicts that must be resolved , pls fix it , thanks @quicksandznzn

JohnJyong avatar Apr 29 '24 07:04 JohnJyong

@quicksandznzn I found some problems: https://github.com/langgenius/dify/blob/a591366f727da7d6ae3612aaec678fe081e84e11/api/core/rag/datasource/vdb/tencent/tencent_vector.py#L158-L161

  • param limit is needed when query with filter, why not use delete with filter.
  • fields in metadata should be indexed if we need to filter by them.

The self.collection won't be initialized in multithread because of redis lock (For example, create a dataset), why not use self._db.collection(self._collection_name)

zeroameli avatar May 26 '24 12:05 zeroameli

it takes so long~~ 😭

wade30822 avatar Jun 07 '24 09:06 wade30822