chroma
chroma copied to clipboard
Add Text2VecEmbeddingFunction
Add Text2VecEmbeddingFunction for better Chinese sentence embedding.
Description of changes
Summarize the changes made by this PR.
- New functionality
- Just add a new embedding function using text2vec model for Chinese sentence embedding.
Test plan
How are these changes tested?
import chromadb
chroma_client = chromadb.Client()
text2vec_ef = embedding_functions.Text2VecEmbeddingFunction()
collection = chroma_client.create_collection(name="test", embedding_function=text2vec_ef)
collection.add(
documents=[
"小明被窗外枝头上小鸟的叫声吵醒了,迷迷糊糊的睁开了眼睛。",
"小明从床上爬起来,走向洗手间开始洗漱。",
"小明从厨房冰箱里拿了盒牛奶准备泡燕麦当早餐,但他发现燕麦已经没有了。",
"小明又从冰箱里拿了两片面包作为早餐。",
"小明准备出门去校车停靠点,在路上遇到了也要去上学的小王和小美。",
"小明在校车上跟小伙伴们聊到了昨天晚上他在《迷你世界》里玩的很开心。",
"小明到达学校,跟班上的同学们一起做了早操。",
"第一节课是语文课,今天学习古诗《枫桥夜泊》。",
"第二节课是数学课,老师讲了关于二元一次方程的解法,小明感觉自己没怎么搞清楚。",
"课间休息的时候小明看到有人在玩飞盘,他也参与进去了。",
"飞盘很好玩,但是小明被飞盘划破了手,不过问题不大贴个创可贴就好了。",
"第三节课是自然课,老师给同学们介绍了一些昆虫的习性,小明对这个很感兴趣。",
"第四节课是美术课,老师让大家画自己最想去的地方,小明画了一个被群山围绕的湖泊,他觉得这个地方很美。",
"中午放学小明和同学们一起去食堂吃了午饭,有他最爱吃的烧鸡。",
"吃完饭小明有点犯困,于是趴在桌子上睡了个午觉,醒来后感觉精神好多了。",
"下午有两节体育课,这是小明最喜欢上的课了,老师组织了一场足球赛,小明找准时机射门但射偏了。",
"上完体育课就放学了,小明跟他的好朋友小王和小美一起回家,并约好了等下一起玩《迷你世界》。",
"小明回家吃过晚饭就跟朋友们一起去《迷你世界》冒险了,玩了两个小时才发现作业还没做,于是不得不赶紧把作业做完。",
"快到晚上十点的时候,作业终于写完了,小明去洗漱了一下然后上床睡觉了。"
],
metadatas=[
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"},
{"source": "my_source"}
],
ids=["id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8", "id9", "id10", "id11", "id12", "id13", "id14", "id15", "id16", "id17", "id18", "id19"]
)
results = collection.query( query_texts=["小明弄伤了手"], n_results=2)
print(results)
'''
Using embedded DuckDB without persistence: data will be transient
2023-04-24 20:24:41.434 | DEBUG | text2vec.sentence_model:__init__:74 - Use device: cuda
{'ids': [['id11', 'id16']], 'embeddings': None, 'documents': [['飞盘很好玩,但是小明被飞盘划破了手,不过问题不大贴个创可贴就好了。', '下午有两节体育课,这是小明最喜欢上的课了,老师组织了一场足球赛,小明找准时机射门但射偏了。']], 'metadatas': [[{'source': 'my_source'}, {'source': 'my_source'}]], 'distances': [[183.61741638183594, 300.0113525390625]]}
'''
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository? May need add simple doc similar to Embeddings/Default: Sentence Transformers eg. Chroma provides an text2vec embedding function to create embeddings for Chinese sentences. text2vec is a library for creating sentence and document embeddings. This embedding function runs locally on your machine, and may require you download the model files (this will happen automatically).
text2vec_ef = embedding_functions.Text2VecEmbeddingFunction()
By default, text2vec use 'shibing624/text2vec-base-chinese' model which is sufficient for most cases.
Hi @rxy1212 - I think using the existing SentenceTransformerEmbeddingFunction with model_name='shibing624/text2vec-base-chinese' works here too, and this is already supported. Or does Text2Vec support models not supported by SentenceTransformers that you were hoping to use?
Hi @HammadB Text2VecEmbeddingFunction and SentenceTransformerEmbeddingFunction require different packages. The former needs install text2vec while the latter needs install sentence_transformers, you can't just using the existing SentenceTransformerEmbeddingFunction with model_name='shibing624/text2vec-base-chinese'. Maybe a more abstract EmbeddingFunction class is needed to match both or even more similar EmbeddingFunctions.
And I just want to know how many collections can I create at most in a single client?
We will re-review and merge !