litellm icon indicating copy to clipboard operation
litellm copied to clipboard

[Feature]: caching individual texts from batch call

Open thiswillbeyourgithub opened this issue 1 year ago • 10 comments

The Feature

When doing an API call to get embeddings for 1 list of 100 strings, I expect the cache to remember individually each string instead of only the batch call.

I don't think it's the case right now but I could be wrong, in which case this should be made explicit in the documentation.

Motivation, pitch

Say I want to get embeddings for a long list of documents. If I enable caching I expect the cache to work on 99% of my strings if I changed only 1% of them. Even if I send send only 1 call with as argument the list. Meaning the list itself will be different, but only 1% of the elements in the list would differ and need recaching.

Twitter / LinkedIn details

No response

thiswillbeyourgithub avatar Jan 07 '24 17:01 thiswillbeyourgithub

Working on this now - we'll need to make sure we don't add a ton of latency for large embedding calls

  • can't do a for loop to check, it'll take too long
  • will need to run checks for each individual item in list in parallel

This will also require a change to how we currently cache embedding calls -> moving from caching the input= value to items in the list, if a list is passed in

krrishdholakia avatar Jan 11 '24 08:01 krrishdholakia

Before release, we should do some load testing on this change, to understand how this impacts latency for consecutive unique calls

krrishdholakia avatar Jan 11 '24 09:01 krrishdholakia

Initial PR made - https://github.com/BerriAI/litellm/pull/1417

cc: @thiswillbeyourgithub

This currently only supports async embeddings (as we're running the check for all items in parallel via asyncio.gather)

will that work for your use-case? @thiswillbeyourgithub

krrishdholakia avatar Jan 11 '24 11:01 krrishdholakia

Hi, thanks for taking the time.

It seems fine.

If that helps, I did something akin to that on my own a few days ago:

importe litellm
from joblib import Memory
import numpy as np

def embedder(text_list, result):
    """compute the emebdding of 1 text
    if result is not None, it is the embedding and returned right away. This
    was done to allow caching individual embeddings while still making one batch
    call to the embedder.
    """
    assert isinstance(text_list, list)
    if result is not None:
        assert isinstance(result, list)
        assert len(text_list) == len(result)
        assert all(isinstance(a, np.ndarray) for a in result)
        return result

    vec = litellm.embedding(
            model="mistral/mistral-medium",
            input=text_list,
            )
    return [np.array(d["embedding"]).reshape(1, -1) for d in vec.data]


def embedder_wrapper(list_text):
    mem = Memory(cache_path, verbose=0)
    cached_embedder = mem.cache(embedder, ignore=["result"])
    uncached_texts = [t for t in list_text if not cached_embedder.check_call_in_cache([t], None)]

    if not uncached_texts:
        print("Everything already in cache")
        return [cached_embedder([t], None)[0] for t in list_text]

    if len(uncached_texts) > 1:
        present = len(list_text) - len(uncached_texts)
        print(f"Embeddings present in cache: {present}/{len(list_text)}")

    # get the embeddings for the uncached values
    results = cached_embedder(uncached_texts, None)

    # manually recache the values for each individual memory
    [cached_embedder([t], [r]) for t, r in zip(uncached_texts, results)]

    # combine cached and uncached results in the right order
    to_return = []
    it_results = iter(results)
    cnt = 0
    for i in range(len(list_text)):
        if list_text[i] in uncached_texts:
            to_return.append(next(it_results)[0])
            cnt += 1
        else:
            to_return.append(cached_embedder([list_text[i]], None)[0])
    # make sure the list was emptied
    assert cnt == len(results)
    return to_return


embed_wrapper(["some texts", "to embed"])

It returns only the vectors (so the whole metadata things are not dealt with) but roughly it's just using joblib.Memory to cache the embedding function, set the cache to ignore a "result" argument, then call embedded_wrapper that will load each item of the list that is already in the cache, then call the embedder on the rest, then manually recache by calling embedder while supplying the result in the result argument, then reassemble the cache and live results as a single list and return.

Also, joblib.Memory does not handle async function. But on their github someone shows an example: https://github.com/joblib/joblib/issues/889#issuecomment-1840865997

Edit: I dont guarantee this code. I haven't tested it too much and i had to remove many irrelevant lines before pasting ithere.

thiswillbeyourgithub avatar Jan 11 '24 11:01 thiswillbeyourgithub

@thiswillbeyourgithub this is interesting.

Is joblib basically a way of doing fast in-memory caching?

krrishdholakia avatar Jan 11 '24 17:01 krrishdholakia

Joblib is several things. It contains a wrapper around queue and threading to multiprocess/multithread. But they also have Memory which enables easy caching of functions, methods etc in a local folder. I use it frequently and it's a good package.

thiswillbeyourgithub avatar Jan 11 '24 17:01 thiswillbeyourgithub

oh - @thiswillbeyourgithub how're you using litellm today?

krrishdholakia avatar Jan 11 '24 17:01 krrishdholakia

For many things. I use langchain and scratch implementations for a variety of LLM stuff. I found litellm which made it exceedingly easy to setup various apis. It allowed me to use openrouter, replicate, openai and mistral quickly.

I do regret the lack of whisper support though.

thiswillbeyourgithub avatar Jan 11 '24 17:01 thiswillbeyourgithub

oh - why do you need litellm to support whisper? @thiswillbeyourgithub

krrishdholakia avatar Jan 11 '24 18:01 krrishdholakia

1 because it would allow me to use only litellm instead of having to replace only 90% of my llm calls by litellm and keep openai/replicate elsewhere. 2 because it would allow a lot of flexibility by swapping whisper from openai to various replicate models or locals etc

thiswillbeyourgithub avatar Jan 11 '24 18:01 thiswillbeyourgithub

This is now live

krrishdholakia avatar Feb 05 '24 22:02 krrishdholakia

Great to hear!

Can you confirm the following:

  1. The caching works in async and sync modes
  2. The caching works for batch completion and not just embeddings
  3. This feature is mentionned in the documentation

I think it would be really a shame to not mention it (I didn't see it recently) as it's a definite cost saver and reduces boilerplate of those who care by a substantial margin :)

Thanks a lot for working on this!

thiswillbeyourgithub avatar Feb 06 '24 10:02 thiswillbeyourgithub