litellm
litellm copied to clipboard
[Feature]: reuse vertex_ai client
The Feature
Vertex AI seem to have an overhead (probably auth related?) and hence needs reusing the client. for faster response.
Below is the experiment I did regenerating with fresh client (like litellm) and reusing the model (and client internally) on langchain.
Note that the time is higher in litellm since the input is not exactly same (difference in role probably?) but the mean and std deviation are consistent.
from langchain_community.chat_models.vertexai import ChatVertexAI
from litellm import completion
with litellm
%%timeit -n 1 -r 10
completion(model="gemini-pro", messages=[{"role": "user", "content": "write code for saying hi from LiteLLM"}])
4.64 s ± 1.05 s per loop (mean ± std. dev. of 10 runs, 1 loop each)
with langchain (fresh client)
%%timeit -n 1 -r 10
ChatVertexAI(model_name="gemini-pro").invoke("write code for saying hi from LiteLLM")
2.44 s ± 102 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
with langchain (reuse model)
model = ChatVertexAI(model_name="gemini-pro")
%%timeit -n 1 -r 1
model.invoke("write code for saying hi from LiteLLM")
2.46 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%%timeit -n 1 -r 10
model.invoke("write code for saying hi from LiteLLM")
1.41 s ± 311 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
Motivation, pitch
Vertex AI seem to have an overhead (probably auth related?) and hence needs reusing the client. for faster response.
Twitter / LinkedIn details
https://www.linkedin.com/in/sufiyanadhikari/
Hey @dumbPy it's weird you were seeing the time diff for the client. Translation shouldn't be adding 2s.
I'll investigate this on our end as well. And yes - i do agree - we can definitely reuse the vertex ai client.
@krrishdholakia The outputs for the same input are different. in litellm I am consistently getting longer output and hence the time is higher. might be because of different prompt translation and temperature, and can be looked separately.
The only thing I am points out here is the difference between fresh and reused client time in langchain. In fresh, it's 2.44 s ± 102 ms while in reused it drops to 1.41 s ± 311 ms
Any progress on this issue? Do you have an architecture in mind to perform client caching? Would be happy to help once the architecture is decided.
hey @arnaud-secondlayer, the place to add this would be in set_client in router, where we already do this for the openai / azure clients - https://github.com/BerriAI/litellm/blob/5edb703d781a9a7a2d9ba98205669eb9d95a1680/litellm/router.py#L1740
Happy to do a quick call this week to talk through this, if that helps - https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat
this is fixed on 1.40.2 - we cache vertex ai clients @arnaud-secondlayer @dumbPy