cog-vicuna-13b icon indicating copy to clipboard operation
cog-vicuna-13b copied to clipboard

High latency on the first inference call

Open rlancemartin opened this issue 2 years ago • 3 comments

We are using the replicate integration with LangChain:

llm = Replicate(model="replicate/vicuna-13b:e6d469c2b11008bb0e446c3e9629232f9674581224536851272c54871f84076e",
                input={"temperature": 0.75, "max_length": 3000, "top_p":0.25})

We are benchmarking latency for question-answering using LangChain auto-evaluator app: https://autoevaluator.langchain.com/playground

I run several inference calls and measure latency of each:

  • Call 1: 195.8 sec
  • Call 2: 7.7 sec
  • Call 3: 11.7 sec

We see very high inference latency (e.g., 195 sec) for the initial call.

But, subsequent calls are much faster < 10 sec.

This is consistent across runs.

For example, another run today:

  • Call 1: 241.556 sec
  • Call 2: 5.951 sec
  • Call 3: 11.295 sec

With additional logging, I confirmed that latency is indeed from calling the endpoint.

Why is this?

It hurts the latency assessment of Vicuna-13b relative to other models:

image

rlancemartin avatar May 29 '23 23:05 rlancemartin

@joehoover any ideas on what may be happening?

rlancemartin avatar May 31 '23 03:05 rlancemartin

Wow, 195.8 sec is massive! @joehoover could this be a cold start problem? cc @bfirsh @mattt for visibility.

dankolesnikov avatar May 31 '23 14:05 dankolesnikov

Hey @dankolesnikov and @rlancemartin, sorry for the delay! @dankolesnikov, I was thinking the same thing; however, I just checked and we have the model set to always on.

@rlancemartin, have you noticed any patterns that might be consistent with the delay being caused by cold starts? E.g., any sense of how long you need to wait for a request to be an "initial request" instead of a "subsequent request"?

Also, if you could share the model version ID and the prediction ID for a slow response, I'll try to identify a root cause.

joehoover avatar Jun 02 '23 14:06 joehoover