chroma icon indicating copy to clipboard operation
chroma copied to clipboard

Add Text2VecEmbeddingFunction

Open rxy1212 opened this issue 2 years ago • 3 comments

Add Text2VecEmbeddingFunction for better Chinese sentence embedding.

Description of changes

Summarize the changes made by this PR.

  • New functionality
    • Just add a new embedding function using text2vec model for Chinese sentence embedding.

Test plan

How are these changes tested?

import chromadb

chroma_client = chromadb.Client()
text2vec_ef = embedding_functions.Text2VecEmbeddingFunction()
collection = chroma_client.create_collection(name="test", embedding_function=text2vec_ef)

collection.add(
    documents=[
        "小明被窗外枝头上小鸟的叫声吵醒了,迷迷糊糊的睁开了眼睛。",
        "小明从床上爬起来,走向洗手间开始洗漱。",
        "小明从厨房冰箱里拿了盒牛奶准备泡燕麦当早餐,但他发现燕麦已经没有了。",
        "小明又从冰箱里拿了两片面包作为早餐。",
        "小明准备出门去校车停靠点,在路上遇到了也要去上学的小王和小美。",
        "小明在校车上跟小伙伴们聊到了昨天晚上他在《迷你世界》里玩的很开心。",
        "小明到达学校,跟班上的同学们一起做了早操。",
        "第一节课是语文课,今天学习古诗《枫桥夜泊》。",
        "第二节课是数学课,老师讲了关于二元一次方程的解法,小明感觉自己没怎么搞清楚。",
        "课间休息的时候小明看到有人在玩飞盘,他也参与进去了。",
        "飞盘很好玩,但是小明被飞盘划破了手,不过问题不大贴个创可贴就好了。",
        "第三节课是自然课,老师给同学们介绍了一些昆虫的习性,小明对这个很感兴趣。",
        "第四节课是美术课,老师让大家画自己最想去的地方,小明画了一个被群山围绕的湖泊,他觉得这个地方很美。",
        "中午放学小明和同学们一起去食堂吃了午饭,有他最爱吃的烧鸡。",
        "吃完饭小明有点犯困,于是趴在桌子上睡了个午觉,醒来后感觉精神好多了。",
        "下午有两节体育课,这是小明最喜欢上的课了,老师组织了一场足球赛,小明找准时机射门但射偏了。",
        "上完体育课就放学了,小明跟他的好朋友小王和小美一起回家,并约好了等下一起玩《迷你世界》。",
        "小明回家吃过晚饭就跟朋友们一起去《迷你世界》冒险了,玩了两个小时才发现作业还没做,于是不得不赶紧把作业做完。",
        "快到晚上十点的时候,作业终于写完了,小明去洗漱了一下然后上床睡觉了。"
    ],
    metadatas=[
        {"source": "my_source"}, 
        {"source": "my_source"},
        {"source": "my_source"}, 
        {"source": "my_source"},
        {"source": "my_source"}, 
        {"source": "my_source"},
        {"source": "my_source"}, 
        {"source": "my_source"},
        {"source": "my_source"}, 
        {"source": "my_source"},
        {"source": "my_source"}, 
        {"source": "my_source"},
        {"source": "my_source"}, 
        {"source": "my_source"},
        {"source": "my_source"}, 
        {"source": "my_source"},
        {"source": "my_source"}, 
        {"source": "my_source"},
        {"source": "my_source"}
    ],
    ids=["id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8", "id9", "id10", "id11", "id12", "id13", "id14", "id15", "id16", "id17", "id18", "id19"]
)

results = collection.query( query_texts=["小明弄伤了手"],  n_results=2)
print(results)

'''
Using embedded DuckDB without persistence: data will be transient
2023-04-24 20:24:41.434 | DEBUG    | text2vec.sentence_model:__init__:74 - Use device: cuda
{'ids': [['id11', 'id16']], 'embeddings': None, 'documents': [['飞盘很好玩,但是小明被飞盘划破了手,不过问题不大贴个创可贴就好了。', '下午有两节体育课,这是小明最喜欢上的课了,老师组织了一场足球赛,小明找准时机射门但射偏了。']], 'metadatas': [[{'source': 'my_source'}, {'source': 'my_source'}]], 'distances': [[183.61741638183594, 300.0113525390625]]}

'''

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository? May need add simple doc similar to Embeddings/Default: Sentence Transformers eg. Chroma provides an text2vec embedding function to create embeddings for Chinese sentences. text2vec is a library for creating sentence and document embeddings. This embedding function runs locally on your machine, and may require you download the model files (this will happen automatically).

text2vec_ef = embedding_functions.Text2VecEmbeddingFunction()

By default, text2vec use 'shibing624/text2vec-base-chinese' model which is sufficient for most cases.

rxy1212 avatar Apr 24 '23 12:04 rxy1212

Hi @rxy1212 - I think using the existing SentenceTransformerEmbeddingFunction with model_name='shibing624/text2vec-base-chinese' works here too, and this is already supported. Or does Text2Vec support models not supported by SentenceTransformers that you were hoping to use?

HammadB avatar Apr 24 '23 18:04 HammadB

Hi @HammadB Text2VecEmbeddingFunction and SentenceTransformerEmbeddingFunction require different packages. The former needs install text2vec while the latter needs install sentence_transformers, you can't just using the existing SentenceTransformerEmbeddingFunction with model_name='shibing624/text2vec-base-chinese'. Maybe a more abstract EmbeddingFunction class is needed to match both or even more similar EmbeddingFunctions.

rxy1212 avatar Apr 25 '23 01:04 rxy1212

And I just want to know how many collections can I create at most in a single client?

rxy1212 avatar Apr 25 '23 12:04 rxy1212

We will re-review and merge !

jeffchuber avatar May 11 '23 17:05 jeffchuber