chroma icon indicating copy to clipboard operation
chroma copied to clipboard

[Feature Request]: Multiple embeddings per document

Open eebmagic opened this issue 2 years ago • 6 comments

Describe the problem

Is there a way to use multiple embeddings functions?

My understanding is that each collection has a single embedding function. I would like to be able to have several and then specify which one to use when querying.

Is there some way to do this with already with multiple collections? If so ideally I would be able to have one collection with NO embeddings and JUST documents and ids. And multiple other collections with NO documents and JUST embeddings and ids.

Describe the proposed solution

I should be able to create a collection with multiple embedding functions. Ideally I should be able to add more functions later as well.

OR

There should be a clear way to create a collection with NO embeddings, and then a way to create a collection that sources its docs from another collection.

Alternatives considered

No response

Importance

would make my life easier

Additional Information

No response

eebmagic avatar Nov 02 '23 16:11 eebmagic

@eebmagic can you tell me more about your use case? and the motivations behind this? we have thought about this but currently don't have a super clean way of doing this internally.

jeffchuber avatar Nov 03 '23 21:11 jeffchuber

@eebmagic can you tell me more about your use case? and the motivations behind this? we have thought about this but currently don't have a super clean way of doing this internally.

jeffchuber avatar Nov 03 '23 21:11 jeffchuber

@jeffchuber I'd like to be able to do some comparisons of embedding models for one large set of docs. Also this would be nice if I had specializer models and wanted to query different aspects of documents with those embeddings without having to manage parallel collections.

I'm assuming that creating several collections with the same docs but different embedding functions would result in redundantly storing the docs for all N collections (and would also just be a bit more work to manage N collections) which would require alot more storage than I can afford given my document set size.

I think the ideal setup for me would be having one collection with the docs and multiple embeddings for each doc and then when I run a query I would just specify which embedding to use. I think Milvus lets you do this, but I was having other problems with Milvus.

eebmagic avatar Nov 03 '23 21:11 eebmagic

@eebmagic currently our recommended path here is to write a small util to manage the "duplicated" index. should be pretty easy - ofc a bit less efficient to store the metadata and document twice, but I think the simplicity and making it "easy to reason about" is worth it

jeffchuber avatar Nov 03 '23 22:11 jeffchuber

So if say, for example, you want to parse a directory of documents and define multiple collections because you want different embeddings, do you need to separately add the documents to each collection with its .add_documents method?

But if so, each collection will have its own doc_id for the same retrieved document, so how will you be able to combine them together?

namp avatar Nov 18 '23 14:11 namp

@namp I'm imagining if the solution were to require multiple collections, then I should be able to create the second collection as a "bound" or "slave" collection to the first that already has all the docs. Then ideally when ever the doc set changes in the first collection the embeddings would be built in both collections.

I don't think doc ids don't have to be unique across collections, so you could be able to have a single id for a doc, the actual doc text stored in a single collection, but multiple embeddings associated with the doc id in separate collections.

The major point is to avoid redundant storing of a large set of texts.

eebmagic avatar Dec 05 '23 20:12 eebmagic

I have a use case. I have used openAI as embedding model for my current database setup, but I would like to start migrating to a different provider to cut down on costs. Thank would mean I have to backfill my DB once again with different embeddings. Having both version ready in production would help cut down on costs on more heavily used elements of my app, while retaining the richer output for other parts of it. Having both versions available in one document would leave me with a lot cleaner database and save me time having to use different database routes for specific use cases (I imagine this might get messy real quick).

Jkense avatar May 23 '24 11:05 Jkense

@namp @eebmagic @Jkense while we don't currently support this, there are a couple of workarounds to try.

  1. Have a 'master' collection which acts as the document store, with original embeddings. 'slave' collections can then contain just the new embeddings, and metadata which references the document ids in the master collection. When querying a 'slave' collection, after results are retrieved, extract ids from the result's metadata, and call get(ids=[...]) on the master collection.

This approach avoids redundant storage of documents, and lets you link them in this fashion, though it requires some additional logic.

  1. Write a DataLoader for text like the one we have for images (we would be interested in this contribution). Both collections can then just refer to the same uri. In this setup you'd be maintaining your own document storage.

Neither is perfect but we don't currently have specific plans to implement this feature, though we may in future.

atroyn avatar May 28 '24 21:05 atroyn

Use case: chunks tend to be longer, but include summaries.

As a user, I would like to be able to find chunks be the vector similarity one of/either/both the summary and/or the raw content itself.

Motivation: As small LLMs become better they will become a part of the data ingest process more and more. It's likely I'm the future that chunk size will grow and we will need semantic search both on the original chunk content as well as the summary, and let the LLM further down in the RAG process pick which part of the chunk to include in the context window.

Food for thought: While this is achievable with multiple collections, is it really desired?

baughmann avatar Jul 14 '24 03:07 baughmann

@namp @eebmagic @Jkense while we don't currently support this, there are a couple of workarounds to try.

  1. Have a 'master' collection which acts as the document store, with original embeddings. 'slave' collections can then contain just the new embeddings, and metadata which references the document ids in the master collection. When querying a 'slave' collection, after results are retrieved, extract ids from the result's metadata, and call get(ids=[...]) on the master collection.

This approach avoids redundant storage of documents, and lets you link them in this fashion, though it requires some additional logic.

  1. Write a DataLoader for text like the one we have for images (we would be interested in this contribution). Both collections can then just refer to the same uri. In this setup you'd be maintaining your own document storage.

Neither is perfect but we don't currently have specific plans to implement this feature, though we may in future.

When I do this, and call get(ids=[...]), I get the result but the metadatas ids are sorted from 0 to greater. It does not keep the original order. How can I go around it? It is pivotal since top n should be at the top of the list.

Jeriousman avatar Jan 01 '25 13:01 Jeriousman