Hi

I am writing to inquire about the inference time for your Large Language Model, Dolly. I have been loading the Dolly model from Huggingface on my AWS machine and have noticed that the inference time is quite long. As a result, I am interested to know the inference time for Dolly when it is being used on your site.

Could you please provide me with some information on the inference time for Dolly when it is being used on your site? Do you have any quantitative results posted?

Here is the information on my site: Machine: AWS instance g4dn.12xlarge Model: databricks/dolly-v1-6b on huggingface

of Tokens: 900 on average

Inference Time: 100-200 seconds per utterance

Apr 25 '23 01:04 JiunHaoJhan

What site are you referring to here? You should use the 'v2' models. With no particular tuning, on an A10 for example, you might expect 3-5 secs for the 3B model, 10-15 for the 7b model, etc. I'm not sure what you're observing. Per token that's probably <100ms. It really depends on generation settings, hardware, model size. You are on a T4 GPU, which won't be great for this.

Apr 25 '23 14:04 srowen

Thanks for your reply. How many GPUs do you use?

Apr 26 '23 23:04 JiunHaoJhan

How many GPUs for what?

Apr 26 '23 23:04 srowen

Sorry for not expressing clearly.

How many GPUs do you use for inference in 100ms per token? I'd like to figure out the configuration, like the machine, models, and number of GPUs, you use to get the inference time you mention above.

Apr 27 '23 00:04 JiunHaoJhan

1 GPU like an A10 on the 3B model, maybe; I haven't measured it closely. A more optimized deployment of the 12B model can hit more like 10ms / token if you do it right. Really depends on so many factors.

Apr 27 '23 00:04 srowen

Inference Time on your site?

of Tokens: 900 on average