langchain icon indicating copy to clipboard operation
langchain copied to clipboard

add caching for embeddings

Open manziman opened this issue 1 year ago • 7 comments

Adds caching for embeddings using the following cache providers:

  • InMemory
  • Redis

Issue: https://github.com/hwchase17/langchain/issues/851

I wasn't sure exactly how this should be added so I chose to modify the Embeddings interface to include an optional embeddings cache since the current llm_cache implementation couldn't be used. I thought about trying to refactor the llm_cache to serve as a more general caching layer but didn't want to make a monster PR for my first contribution.

Any input is welcome from maintainers, I am happy to refactor this if it needs to be implemented differently, like llm_cache or something else.

manziman avatar Mar 23 '23 15:03 manziman

I followed the contributing guide but for some reason I couldn't get the docs to build locally (or there was no diff?).

manziman avatar Mar 23 '23 15:03 manziman

Random thought, I think that embed_query(self, text: str) -> List[float]:

Should be more like the llm_cache, and have a reference to the model being used to embed so that you do not run into issues if you have multiple embedding models at the same time.

cptspacemanspiff avatar Mar 25 '23 03:03 cptspacemanspiff

Thank you both for the feedback! I'll work on those changes.

manziman avatar Mar 28 '23 20:03 manziman

Thank you both for the feedback! I'll work on those changes.

see a related pr! https://github.com/hwchase17/langchain/pull/2103

hwchase17 avatar Mar 28 '23 22:03 hwchase17

Just noticed, it applies to this but also the current LLM cache implementation: The hash() function in python is (by default) not repeatable between runs. see: https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions

This only seems to be used in for the redis database, but I noticed your implementation is using the same hash logic...

I am not sure what the best replacement is.

cptspacemanspiff avatar Mar 29 '23 00:03 cptspacemanspiff

Just noticed, it applies to this but also the current LLM cache implementation: The hash() function in python is (by default) not repeatable between runs. see: https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions

This only seems to be used in for the redis database, but I noticed your implementation is using the same hash logic...

I am not sure what the best replacement is.

I was going to use hashlib and sha but I saw hash being used in the llm cache, I had no real reason for choosing it other than that. Didn't realize that it had that behavior, I think sha would be a better choice here then.

manziman avatar Mar 30 '23 00:03 manziman

Just noticed, it applies to this but also the current LLM cache implementation: The hash() function in python is (by default) not repeatable between runs. see: https://stackoverflow.com/questions/27522626/hash-function-in-python-3-3-returns-different-results-between-sessions This only seems to be used in for the redis database, but I noticed your implementation is using the same hash logic... I am not sure what the best replacement is.

I was going to use hashlib and sha but I saw hash being used in the llm cache, I had no real reason for choosing it other than that. Didn't realize that it had that behavior, I think sha would be a better choice here then.

We fixed the non-deterministic hash issue -- should be up to date on master and in the latest release!

Also -- interested in the caching approach for embeddings. I will take a look here and see what we might be able to do.

tylerhutcherson avatar Apr 29 '23 16:04 tylerhutcherson

Any updates on this?

jxmorris12 avatar Jun 05 '23 13:06 jxmorris12

Embedding caches are now available in langchain: https://python.langchain.com/docs/modules/data_connection/text_embedding/caching_embeddings

eyurtsev avatar Sep 01 '23 20:09 eyurtsev