text-generation-inference
text-generation-inference copied to clipboard
Falcon 40B slow inference
Currently, I am running Falcon quantized on 4 X Nvidia T4 GPUs, all running on the same system. I am getting time_per_token
during inference of around 190 ms. Below is my run command
docker run --gpus all --shm-size 4g -p 8080:80 --name "falcon40b" --log-driver=local --log-opt max-size=10m --log-opt max-file=3 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8 --model-id tiiuae/falcon-40b-instruct --num-shard 4 --quantize "bitsandbytes"
Is there any optimization I can do to improve inference speed?
Wait for this to land: https://github.com/huggingface/text-generation-inference/pull/438 so you can use a better latency kernel (GPTQ)
Wait for this to land: #438 so you can use a better latency kernel (GPTQ)
Hi @Narsil this is really exciting! do you have any early numbers to share about how much faster GPTQ kernels typically are compared to bitsandbytes?
GPTQ are as fast as not quantized versions. I never ran bitsandbytes, so I have no clue, but iirc multiple times slower (~4x maybe ?) .
@jiyuanq This article mentions that it was tested to be ~20% slower for the BLOOM model, not sure otherwise. @Narsil once the GPTQ change gets merged I am assuming I only need to replace --quantize "gptq"
instead of --quantize "bitsandbytes"
. Correct? Or do I also need to replace the docker image?
I see. Previously I tried quantization on falcon-7b, and got 58ms per token with bitsandbytes, while without quantization it was 31ms per token. If GPTQ can be as fast as non-quantized versions, it's going to be almost 2x speed up with half the memory footprint compared to bitsandbytes. A huge win indeed!
I only need to replace --quantize "gptq" instead of --quantize "bitsandbytes". Correct? Or do I also need to replace the docker image?
Well you would need the newest docker to actually get support :)
And then you need to replace both the model and the line. The quantization of GPTQ cannot be done during the load of the model like bitsandbytes
.
https://huggingface.co/huggingface/falcon-40b-gptq for instance. We need to create them for the canonical falcon, not sure we can keep up with the community ones though :) but the PR includes the quantization script.
I see. Previously I tried quantization on falcon-7b, and got 58ms per token with bitsandbytes, while without quantization it was 31ms per token. If GPTQ can be as fast as non-quantized versions, it's going to be almost 2x speed up with half the memory footprint compared to bitsandbytes. A huge win indeed!
These numbers are however not competitive with OpenAI solutions, unfortunately. I'm using TGI on Sagemaker and it takes ~40ms to generate a token on a non-quantized version and, overall, to generate 1000 tokens it takes 1 minute (an eternity :/).
@mspronesti, sagemaker does not support token streaming. With streaming, you get a way better UX. With Huggingface Endpoint, you can use streaming and close the gap to openAI.
@OlivierDehaene token streaming would only improve the user experience, doesn't it ? I mean, if the overall generation takes 1 min, it will take 1 min anyway even if I see the tokens appearing "live", right ?
Here are my tests just in case people are wondering about the speed of GPTQ/bitsandbytes performed on 4x NVIDIA A100 GPUs with 80G:
meta-llama/Llama-2-13b-chat-hf
Total inference time: 2.22 s
Number of tokens generated: 82
Time per token: 0.03 ms/token
Tokens per second: 36.95 token/s
meta-llama/Llama-2-70b-chat-hf
Total inference time: 4.42 s
Number of tokens generated: 88
Time per token: 0.05 ms/token
Tokens per second: 19.92 token/s
meta-llama/Llama-2-70b-chat-hf (bitsandbytes-fp4)
Total inference time: 7.58 s
Number of tokens generated: 89
Time per token: 0.09 ms/token
Tokens per second: 11.74 token/s
meta-llama/Llama-2-70b-chat-hf (bitsandbytes-nf4)
Total inference time: 7.87 s
Number of tokens generated: 81
Time per token: 0.10 ms/token
Tokens per second: 10.29 token/s
TheBloke/Llama-2-70B-chat-GPTQ
Total inference time: 7.69 s
Number of tokens generated: 83
Time per token: 0.09 ms/token
Tokens per second: 10.80 token/s
Here is almost half the speed when you go to GPTQ in Llama-2-70B
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.