api-inference-community model loading inference API

model loading inference API

Open clmnt opened this issue 2 years ago • 2 comments

Describe the bug

it gets stuck at model loading

Reproduction

go to https://huggingface.co/nitrosocke/classic-anim-diffusion and prompt for the first time

https://www.loom.com/share/10fdb5920e0248cc8162e145f8957d77

Logs

No response

System info

chrome

Nov 16 '22 23:11 clmnt

The first time it says that the model is loading. When you do the refresh it turned out the model was now loaded, so the inference was fast this time. Moving to community repo

Nov 16 '22 23:11 osanseviero

Multiple issues things at play that are currently known about:

Model loading is not really using correct information. api-inference doesn't know how to "guess" the model size properly, so the loading bar is not accurate. It's never acurate, but the simple rule of thumb would still mean the loading bar would be bigger and more representative.
First loads are always much longer due to downloading the weights
Sometimes, depending on cluster conditions creating the docker is slower than usual (depends how many GPUs are used, how many nodes are available etc.. creating a new node on demand is much slower than just launching the pod)
Inference still takes 5-6s which feels very "slow" to us humans. Using xformers and fast attention should help a bit (expected to go down to 3s).

Here I'm thinking 1/ and 4/ are the most effective things we can do something about. We're also working on adding tracing to the cluster so we have a better picture of 2 and 3.

@NouamaneTazi

Nov 17 '22 10:11 Narsil

api-inference-community api-inference-community copied to clipboard

model loading inference API

Describe the bug

Reproduction

Logs

System info

api-inference-community
api-inference-community copied to clipboard