langchain add caching for embeddings

add caching for embeddings

Open manziman opened this issue 1 year ago • 7 comments

Adds caching for embeddings using the following cache providers:

InMemory
Redis

Issue: https://github.com/hwchase17/langchain/issues/851

I wasn't sure exactly how this should be added so I chose to modify the Embeddings interface to include an optional embeddings cache since the current llm_cache implementation couldn't be used. I thought about trying to refactor the llm_cache to serve as a more general caching layer but didn't want to make a monster PR for my first contribution.

Any input is welcome from maintainers, I am happy to refactor this if it needs to be implemented differently, like llm_cache or something else.

Mar 23 '23 15:03 manziman

I followed the contributing guide but for some reason I couldn't get the docs to build locally (or there was no diff?).

Mar 23 '23 15:03 manziman

Random thought, I think that embed_query(self, text: str) -> List[float]:

Should be more like the llm_cache, and have a reference to the model being used to embed so that you do not run into issues if you have multiple embedding models at the same time.

Mar 25 '23 03:03 cptspacemanspiff

Thank you both for the feedback! I'll work on those changes.

Mar 28 '23 20:03 manziman

Thank you both for the feedback! I'll work on those changes.

see a related pr! https://github.com/hwchase17/langchain/pull/2103

Mar 28 '23 22:03 hwchase17

Just noticed, it applies to this but also the current LLM cache implementation: The hash() function in python is (by default) not repeatable between runs. see: https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions

This only seems to be used in for the redis database, but I noticed your implementation is using the same hash logic...

I am not sure what the best replacement is.

Mar 29 '23 00:03 cptspacemanspiff

Just noticed, it applies to this but also the current LLM cache implementation: The hash() function in python is (by default) not repeatable between runs. see: https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions

This only seems to be used in for the redis database, but I noticed your implementation is using the same hash logic...

I am not sure what the best replacement is.

I was going to use hashlib and sha but I saw hash being used in the llm cache, I had no real reason for choosing it other than that. Didn't realize that it had that behavior, I think sha would be a better choice here then.

Mar 30 '23 00:03 manziman

Just noticed, it applies to this but also the current LLM cache implementation: The hash() function in python is (by default) not repeatable between runs. see: https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions This only seems to be used in for the redis database, but I noticed your implementation is using the same hash logic... I am not sure what the best replacement is.

I was going to use hashlib and sha but I saw hash being used in the llm cache, I had no real reason for choosing it other than that. Didn't realize that it had that behavior, I think sha would be a better choice here then.

We fixed the non-deterministic hash issue -- should be up to date on master and in the latest release!

Also -- interested in the caching approach for embeddings. I will take a look here and see what we might be able to do.

Apr 29 '23 16:04 tylerhutcherson

Any updates on this?

Jun 05 '23 13:06 jxmorris12

Embedding caches are now available in langchain: https://python.langchain.com/docs/modules/data_connection/text_embedding/caching_embeddings

Sep 01 '23 20:09 eyurtsev

langchain langchain copied to clipboard

add caching for embeddings

langchain
langchain copied to clipboard