core icon indicating copy to clipboard operation
core copied to clipboard

Create new VectorStore collections whenever the embedder changes

Open pieroit opened this issue 2 years ago • 1 comments
trafficstars

If there is an embedder change, the VectorStore will not be compatible because:

  • the new embedder may have a different output dimensionality
  • even if the dimensionality is the same, it is a totally different space

So whenever the embedder changes:

  • delete the old VectorStore collections
  • create new ones (this is already done lazily and works perfectly if there are no VectorStore collections on disk)

Or if we want to preserve old memories, prepend embedder name to collection mane

pieroit avatar Mar 24 '23 14:03 pieroit

Partial fix: the cat itself advices user in the chat about a change in dimensionality for the embedder and invites the user to delete the web/long_term_memory folder. After reintroducing Qdrant we can find a more elegant solution. Leaving the issue open

pieroit avatar Apr 05 '23 19:04 pieroit

I coul work on this issue

nickprock avatar May 21 '23 19:05 nickprock

@nickprock change of embedder at runtime is at the moment deactivated

To reactivate decomment here: https://github.com/pieroit/cheshire-cat/blob/main/core/cat/routes/setting/embedder_setting.py#L54

Also to change embedder from the admin reactivate this button (take away the disabled attribute): https://github.com/pieroit/cheshire-cat/blob/main/admin/src/views/SettingsView.vue#L44

You should see that a change of embedder results in an error (or the cat says that embedders have different dimensionality)

Let me know if this is clear enough, otherwise I'll give more references

pieroit avatar May 21 '23 20:05 pieroit

Thanks @pieroit , I was able to recreate the error it was giving me before the changes.

{"error":"Wrong input: Vector inserting error: expected dim: 1536, got 4096"}

Starting tonight I will get on it, I already have some ideas.

nickprock avatar May 22 '23 06:05 nickprock

Thanks @pieroit , I was able to recreate the error it was giving me before the changes.

{"error":"Wrong input: Vector inserting error: expected dim: 1536, got 4096"}

Great!

Starting tonight I will get on it, I already have some ideas.

I'm glad for your help. Let me know how can we fix this! The easiest path I see is just wipeout the old collections and recreate them (that is done automagically when you bootstrap the cat). As a solution is easy but a little violent - all memories get lost for an embedder change. Dunno

pieroit avatar May 22 '23 10:05 pieroit

Hi @pieroit , I tried to split the problem:

  1. configure the embedding dimension dinamically
  2. create new VectorStore collections whenever the embedder changes

At the moment, for point 1, if in vector_memory.py I print the embedder:

class VectorMemoryCollection(Qdrant):
    def __init__(self, cat, client: Any, collection_name: str, embedding_function: Callable):
        super().__init__(client, collection_name, embedding_function)

        # Get a Cat instance
        self.cat = cat
        print("print embedder ",self.cat.embedder, "\n")

if I use HuggingFace Hub it doesn't print anything. If I use the default Cohere embeddings I can't change it with another cohere embedding model.

nickprock avatar May 23 '23 08:05 nickprock

@pieroit little (maybe stupid) idea. If we train a dimensionality reduction model like UMAP on large datasets and use it to compress the embeddings from different embedder to the same size?

I need to enable the embedder config on the app, in the old version I ebaled https://github.com/pieroit/cheshire-cat/blob/main/admin/src/views/SettingsView.vue#L44 but now?

nickprock avatar May 30 '23 08:05 nickprock

These are placeholders:

  • https://stackoverflow.com/questions/60290296/word2vec-compare-vectors-from-different-models-with-different-sizes
  • https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9
  • https://www.philschmid.de/optimize-sentence-transformers

nickprock avatar Jun 08 '23 07:06 nickprock

A chage of embedder is responsibility of the admin, it's going to be really hard for us to make it feasible. Let's just dump current collection on disk as a backup, and create a new collection for the new embedder.

In this way if people makes a wrong change there is still a possibility to recover the old collection.

A local embedder (in the same or different container) based on sentence-transformer sounds like a good default, if we manage to make it lightweight

pieroit avatar Jun 12 '23 10:06 pieroit

@pieroit another placeholder: https://python.langchain.com/en/latest/modules/models/text_embedding/examples/self-hosted.html

now the situation is more clear for me

nickprock avatar Jun 16 '23 05:06 nickprock

yes @nickprock should be feasible to subclass Embeddings

Checkin out ONNX runtime right now

pieroit avatar Jun 16 '23 09:06 pieroit

I have added qdrant aliases in the last PR. currently tag collections to avoid 2 embedders with the same size but you can think of other ways to use them in the future

nickprock avatar Jun 19 '23 07:06 nickprock