text-generation-inference Falcon 40B slow inference

Currently, I am running Falcon quantized on 4 X Nvidia T4 GPUs, all running on the same system. I am getting time_per_token during inference of around 190 ms. Below is my run command

docker run --gpus all --shm-size 4g -p 8080:80 --name "falcon40b" --log-driver=local --log-opt max-size=10m --log-opt max-file=3 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8 --model-id tiiuae/falcon-40b-instruct --num-shard 4 --quantize "bitsandbytes"

Is there any optimization I can do to improve inference speed?

Jun 14 '23 03:06 vempaliakhil96

Wait for this to land: https://github.com/huggingface/text-generation-inference/pull/438 so you can use a better latency kernel (GPTQ)

Jun 14 '23 07:06 Narsil

Wait for this to land: #438 so you can use a better latency kernel (GPTQ)

Hi @Narsil this is really exciting! do you have any early numbers to share about how much faster GPTQ kernels typically are compared to bitsandbytes?

Jun 14 '23 07:06 jiyuanq

GPTQ are as fast as not quantized versions. I never ran bitsandbytes, so I have no clue, but iirc multiple times slower (~4x maybe ?) .

Jun 14 '23 13:06 Narsil

@jiyuanq This article mentions that it was tested to be ~20% slower for the BLOOM model, not sure otherwise. @Narsil once the GPTQ change gets merged I am assuming I only need to replace --quantize "gptq" instead of --quantize "bitsandbytes". Correct? Or do I also need to replace the docker image?

Jun 14 '23 13:06 vempaliakhil96

I see. Previously I tried quantization on falcon-7b, and got 58ms per token with bitsandbytes, while without quantization it was 31ms per token. If GPTQ can be as fast as non-quantized versions, it's going to be almost 2x speed up with half the memory footprint compared to bitsandbytes. A huge win indeed!

Jun 15 '23 03:06 jiyuanq

I only need to replace --quantize "gptq" instead of --quantize "bitsandbytes". Correct? Or do I also need to replace the docker image?

Well you would need the newest docker to actually get support :) And then you need to replace both the model and the line. The quantization of GPTQ cannot be done during the load of the model like bitsandbytes.

https://huggingface.co/huggingface/falcon-40b-gptq for instance. We need to create them for the canonical falcon, not sure we can keep up with the community ones though :) but the PR includes the quantization script.

Jun 15 '23 08:06 Narsil

I see. Previously I tried quantization on falcon-7b, and got 58ms per token with bitsandbytes, while without quantization it was 31ms per token. If GPTQ can be as fast as non-quantized versions, it's going to be almost 2x speed up with half the memory footprint compared to bitsandbytes. A huge win indeed!

These numbers are however not competitive with OpenAI solutions, unfortunately. I'm using TGI on Sagemaker and it takes ~40ms to generate a token on a non-quantized version and, overall, to generate 1000 tokens it takes 1 minute (an eternity :/).

Jun 21 '23 09:06 mspronesti

@mspronesti, sagemaker does not support token streaming. With streaming, you get a way better UX. With Huggingface Endpoint, you can use streaming and close the gap to openAI.

Jun 21 '23 09:06 OlivierDehaene

@OlivierDehaene token streaming would only improve the user experience, doesn't it ? I mean, if the overall generation takes 1 min, it will take 1 min anyway even if I see the tokens appearing "live", right ?

Jun 21 '23 09:06 mspronesti

Here are my tests just in case people are wondering about the speed of GPTQ/bitsandbytes performed on 4x NVIDIA A100 GPUs with 80G:

meta-llama/Llama-2-13b-chat-hf
Total inference time: 2.22 s
Number of tokens generated: 82
Time per token: 0.03 ms/token
Tokens per second: 36.95 token/s

meta-llama/Llama-2-70b-chat-hf
Total inference time: 4.42 s
Number of tokens generated: 88
Time per token: 0.05 ms/token
Tokens per second: 19.92 token/s

meta-llama/Llama-2-70b-chat-hf (bitsandbytes-fp4)
Total inference time: 7.58 s
Number of tokens generated: 89
Time per token: 0.09 ms/token
Tokens per second: 11.74 token/s

meta-llama/Llama-2-70b-chat-hf (bitsandbytes-nf4)
Total inference time: 7.87 s
Number of tokens generated: 81
Time per token: 0.10 ms/token
Tokens per second: 10.29 token/s

TheBloke/Llama-2-70B-chat-GPTQ
Total inference time: 7.69 s
Number of tokens generated: 83
Time per token: 0.09 ms/token
Tokens per second: 10.80 token/s

Here is almost half the speed when you go to GPTQ in Llama-2-70B

Sep 20 '23 12:09 maziyarpanahi

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Jul 26 '24 01:07 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

Falcon 40B slow inference

text-generation-inference
text-generation-inference copied to clipboard