litellm [Feature]: reuse vertex

trafficstars

The Feature

Vertex AI seem to have an overhead (probably auth related?) and hence needs reusing the client. for faster response.

Below is the experiment I did regenerating with fresh client (like litellm) and reusing the model (and client internally) on langchain.

Note that the time is higher in litellm since the input is not exactly same (difference in role probably?) but the mean and std deviation are consistent.

from langchain_community.chat_models.vertexai import ChatVertexAI
from litellm import completion

with litellm

%%timeit -n 1 -r 10
completion(model="gemini-pro", messages=[{"role": "user", "content": "write code for saying hi from LiteLLM"}])

4.64 s ± 1.05 s per loop (mean ± std. dev. of 10 runs, 1 loop each)

with langchain (fresh client)

%%timeit -n 1 -r 10
ChatVertexAI(model_name="gemini-pro").invoke("write code for saying hi from LiteLLM")

2.44 s ± 102 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

with langchain (reuse model)

model = ChatVertexAI(model_name="gemini-pro")

%%timeit -n 1 -r 1
model.invoke("write code for saying hi from LiteLLM")

2.46 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%%timeit -n 1 -r 10
model.invoke("write code for saying hi from LiteLLM")

1.41 s ± 311 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

Motivation, pitch

Vertex AI seem to have an overhead (probably auth related?) and hence needs reusing the client. for faster response.

Twitter / LinkedIn details

https://www.linkedin.com/in/sufiyanadhikari/

Feb 12 '24 14:02 dumbPy

Hey @dumbPy it's weird you were seeing the time diff for the client. Translation shouldn't be adding 2s.

I'll investigate this on our end as well. And yes - i do agree - we can definitely reuse the vertex ai client.

Feb 13 '24 01:02 krrishdholakia

@krrishdholakia The outputs for the same input are different. in litellm I am consistently getting longer output and hence the time is higher. might be because of different prompt translation and temperature, and can be looked separately.

The only thing I am points out here is the difference between fresh and reused client time in langchain. In fresh, it's 2.44 s ± 102 ms while in reused it drops to 1.41 s ± 311 ms

Feb 13 '24 06:02 dumbPy

Any progress on this issue? Do you have an architecture in mind to perform client caching? Would be happy to help once the architecture is decided.

Mar 07 '24 23:03 arnaud-secondlayer

hey @arnaud-secondlayer, the place to add this would be in set_client in router, where we already do this for the openai / azure clients - https://github.com/BerriAI/litellm/blob/5edb703d781a9a7a2d9ba98205669eb9d95a1680/litellm/router.py#L1740

Happy to do a quick call this week to talk through this, if that helps - https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat

Mar 19 '24 02:03 krrishdholakia

this is fixed on 1.40.2 - we cache vertex ai clients @arnaud-secondlayer @dumbPy

Jun 05 '24 16:06 ishaan-jaff

litellm
litellm copied to clipboard

[Feature]: reuse vertex_ai client

The Feature

with litellm

with langchain (fresh client)

with langchain (reuse model)

Motivation, pitch

Twitter / LinkedIn details

litellm litellm copied to clipboard

[Feature]: reuse vertex_ai client

The Feature

with litellm

with langchain (fresh client)

with langchain (reuse model)

Motivation, pitch

Twitter / LinkedIn details

litellm
litellm copied to clipboard