langchain
langchain copied to clipboard
add caching for embeddings
Adds caching for embeddings using the following cache providers:
- InMemory
- Redis
Issue: https://github.com/hwchase17/langchain/issues/851
I wasn't sure exactly how this should be added so I chose to modify the Embeddings interface to include an optional embeddings cache since the current llm_cache implementation couldn't be used. I thought about trying to refactor the llm_cache to serve as a more general caching layer but didn't want to make a monster PR for my first contribution.
Any input is welcome from maintainers, I am happy to refactor this if it needs to be implemented differently, like llm_cache or something else.
I followed the contributing guide but for some reason I couldn't get the docs to build locally (or there was no diff?).
Random thought, I think that embed_query(self, text: str) -> List[float]:
Should be more like the llm_cache, and have a reference to the model being used to embed so that you do not run into issues if you have multiple embedding models at the same time.
Thank you both for the feedback! I'll work on those changes.
Thank you both for the feedback! I'll work on those changes.
see a related pr! https://github.com/hwchase17/langchain/pull/2103
Just noticed, it applies to this but also the current LLM cache implementation: The hash() function in python is (by default) not repeatable between runs. see: https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions
This only seems to be used in for the redis database, but I noticed your implementation is using the same hash logic...
I am not sure what the best replacement is.
Just noticed, it applies to this but also the current LLM cache implementation: The hash() function in python is (by default) not repeatable between runs. see: https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions
This only seems to be used in for the redis database, but I noticed your implementation is using the same hash logic...
I am not sure what the best replacement is.
I was going to use hashlib and sha but I saw hash
being used in the llm cache, I had no real reason for choosing it other than that. Didn't realize that it had that behavior, I think sha would be a better choice here then.
Just noticed, it applies to this but also the current LLM cache implementation: The hash() function in python is (by default) not repeatable between runs. see: https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions This only seems to be used in for the redis database, but I noticed your implementation is using the same hash logic... I am not sure what the best replacement is.
I was going to use hashlib and sha but I saw
hash
being used in the llm cache, I had no real reason for choosing it other than that. Didn't realize that it had that behavior, I think sha would be a better choice here then.
We fixed the non-deterministic hash issue -- should be up to date on master and in the latest release!
Also -- interested in the caching approach for embeddings. I will take a look here and see what we might be able to do.
Any updates on this?
Embedding caches are now available in langchain: https://python.langchain.com/docs/modules/data_connection/text_embedding/caching_embeddings