kernel-memory icon indicating copy to clipboard operation
kernel-memory copied to clipboard

[Feature Request] Cache and manage the embeddings in a persistent storage

Open 0x7c13 opened this issue 3 months ago • 2 comments

Context / Scenario

This post is to dive deeper into this PR for the related topic: https://github.com/microsoft/kernel-memory/pull/389

The problem

The problem is simple: we want to avoid calling the embedding API as much as possible since it is often slow and expensive. One quick and cheap solution is to cache the embeddings by the content hash and see if there is any chance for the collision to happen when feeding the KM with a large documentation or multiple ones with repeated content (that's all above PR is all about).

BUT, I don't think this is an ideal solution for real world scenarios. Why? Because:

  1. We don't get repeated text or paragraphs often in most of the cases.
  2. Above PR only benefits in the scope of current document(s) ingestion.

Let's skip the first one and go straight into the second scenario:

There are lots of cases where we want to update the existing document(s) or re-ingest them as content getting refreshed or updated, either it is a text document or a web page. In both cases, most of the content remain the same but embedding will happen again and again even if you re-import them using the same document id. This is a scenario I believe where a persistent embedding cache storage is needed for improving the speed and reducing the cost of continuously ingested documents.

Proposed solution

In addition to the FileStorageDb and MemoryDb for the vectors and text, we could have another abstraction + implementation for the EmbeddingsCacheDb where it can be configured and used by the GenerateEmbeddingsHandler to avoid re-generating the embeddings for the same partitioned content over time across workers. Ideally storing the content hash in a distributed cache storage like Redis and storing the associated embeddings in a blob storage to work across multiple workers.

We might just need to re-design or update the way how we store the embeddings to make sure it is easy to find if the embedding already exists for the given content hash, so we don't need to store them twice. Ideally just an additional hash mapping of the two is needed or maybe we include the hash in the entity name itself etc.

User should be able to:

  • Customize the storage type and location of this cache.
  • Control the behavior of this cache thru config (a maximum storage limit etc).
  • Violate the cache by certain policy (Ex: all embeddings cache associated with a given document should be removed when the document is deleted with a specified document id or Index)

Importance

would be great to have

0x7c13 avatar Apr 01 '24 22:04 0x7c13