Inference Time on your site?
Hi
I am writing to inquire about the inference time for your Large Language Model, Dolly. I have been loading the Dolly model from Huggingface on my AWS machine and have noticed that the inference time is quite long. As a result, I am interested to know the inference time for Dolly when it is being used on your site.
Could you please provide me with some information on the inference time for Dolly when it is being used on your site? Do you have any quantitative results posted?
Here is the information on my site: Machine: AWS instance g4dn.12xlarge Model: databricks/dolly-v1-6b on huggingface
of Tokens: 900 on average
Inference Time: 100-200 seconds per utterance
What site are you referring to here? You should use the 'v2' models. With no particular tuning, on an A10 for example, you might expect 3-5 secs for the 3B model, 10-15 for the 7b model, etc. I'm not sure what you're observing. Per token that's probably <100ms. It really depends on generation settings, hardware, model size. You are on a T4 GPU, which won't be great for this.
Thanks for your reply. How many GPUs do you use?
How many GPUs for what?
Sorry for not expressing clearly.
How many GPUs do you use for inference in 100ms per token? I'd like to figure out the configuration, like the machine, models, and number of GPUs, you use to get the inference time you mention above.
1 GPU like an A10 on the 3B model, maybe; I haven't measured it closely. A more optimized deployment of the 12B model can hit more like 10ms / token if you do it right. Really depends on so many factors.