GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
4-bit llama gets progressively slower with each text generation
The generation takes more time with each message, as if there's an overhead
For example: The second response is 11x faster than the last response. They have the same number of tokens.
The issue persists both on llama-7b and llama-13b
Running llama with: python3.10 server.py --load-in-4bit --model llama-7b-hf --cai-chat --no-stream
specs: Gpu: RTX 3060 12GB Cpu: Intel i5 12400f Ram: 64GB DDR4 3200MHz OS: Linux
If the text is twice as long, the calculation amount required to generate one token is four times as large (time complexity o(n^2)). n is sequence length
If the text is twice as long, the calculation amount required to generate one token is four times as large (time complexity o(n^2)). n is sequence length
It doesn't happen with llama 8bit. Additionally the second and the last text are the same amount of tokens.
If the text is twice as long, the calculation amount required to generate one token is four times as large (time complexity o(n^2)). n is sequence length
So this would imply that the actual generating of each single token gets slower. But what I am seeing is that only the time until the first generated is slower, and each consecutive generated token is nearly as fast as empty context.
(You can also easily test this by comparing the time it takes to generate 30 tokens at 512 context, vs 300 tokens at 512 context.)
If it helps, this is from a --cai-chat session. Between the 30.33s response to 14.39 is when "Clear history" was executed in the UI.
I subjectively experience the same progressively slower experience, but I have not looked into whether that is expected from how text-generation-ui works.
python server.py --cai-chat --load-in-4bit --model llama-13b-hf
Linux, i7, 96GB RAM, 12GB VRAM (Nvidia RTX A2000), Cuda 11.7
Output generated in 4.10 seconds (2.44 tokens/s, 10 tokens)
Output generated in 4.38 seconds (4.57 tokens/s, 20 tokens)
Output generated in 6.62 seconds (5.89 tokens/s, 39 tokens)
Output generated in 3.81 seconds (2.89 tokens/s, 11 tokens)
Output generated in 6.20 seconds (5.64 tokens/s, 35 tokens)
Output generated in 8.63 seconds (5.45 tokens/s, 47 tokens)
Output generated in 10.87 seconds (3.40 tokens/s, 37 tokens)
Output generated in 12.14 seconds (2.06 tokens/s, 25 tokens)
Output generated in 14.59 seconds (2.95 tokens/s, 43 tokens)
Output generated in 18.39 seconds (3.05 tokens/s, 56 tokens)
Output generated in 18.40 seconds (0.43 tokens/s, 8 tokens)
Output generated in 20.92 seconds (1.10 tokens/s, 23 tokens)
Output generated in 23.03 seconds (1.22 tokens/s, 28 tokens)
Output generated in 26.37 seconds (1.52 tokens/s, 40 tokens)
Output generated in 30.33 seconds (1.38 tokens/s, 42 tokens)
Output generated in 14.39 seconds (13.83 tokens/s, 199 tokens)
Output generated in 13.48 seconds (2.97 tokens/s, 40 tokens)
Output generated in 15.91 seconds (2.33 tokens/s, 37 tokens)
Output generated in 15.71 seconds (0.38 tokens/s, 6 tokens)
Output generated in 20.48 seconds (2.88 tokens/s, 59 tokens)
Output generated in 17.96 seconds (1.06 tokens/s, 19 tokens)
Output generated in 18.62 seconds (0.64 tokens/s, 12 tokens)
Output generated in 19.59 seconds (0.36 tokens/s, 7 tokens)
Output generated in 23.36 seconds (2.31 tokens/s, 54 tokens)
Output generated in 22.69 seconds (0.09 tokens/s, 2 tokens)
Output generated in 25.06 seconds (0.56 tokens/s, 14 tokens)
python llama_inference.py decapoda-research/llama-7b-hf --wbits 4 --load llama7b-4bit.pt --text "this is llama"
The inference speed obtained by running this code.
It definitely shows low inference speed at the beginning.
But this is normal.(https://github.com/pytorch/pytorch/issues/44269)
1.50 5token (0.3 token/s) 2.33 20token (0.1165 token/s) 3.95 50token (0.079 token/s) 6.71 100token (0.0671 token/s) 30.35 500token (0.0607 token/s) 63.79 1000token (0.06379 token/s) 130.39 2000token (0.06519 token/s)
If the text is twice as long, the calculation amount required to generate one token is four times as large (time complexity o(n^2)). n is sequence length
So this would imply that the actual generating of each single token gets slower. But what I am seeing is that only the time until the first generated is slower, and each consecutive generated token is nearly as fast as empty context.
The O(n^2) refers to the total computation time for processing the sequence. So what you should see is that a prompt with 400 tokens and 100 generations should take just as long as one with 100 tokens of prompt and 400 generations. And in either case 1000 tokens total should take twice as long. Basically if you do --min_size=1000 --max_size=1000
you can use any length prompt (shorter than 1000 tokens) and the total time should be the same.
In my testing with 7b quantised to 4 bit, 1600 total tokens take 3.1x as long as 800 total so that's reasonably close to the expected slowdown modulus some constant warmup time loading the model.
(Note that it's not the same if you were to generate 1 token, add it to a new prompt, generate 1 token, etc. The generator reuses the hidden state it already computed for all the previous tokens when it makes the next one, and you'd lose that if you restart the generation.)
In addition to what's been discussed, there might be an unexpected bottleneck related to VRAM usage for context. I contributed a PR that could potentially offer around a 10% performance improvement and a slight drop in VRAM usage. It might be worth taking a look again, see if it helps you.
Just for anyone coming here to check on this, I believe this issue has now been fixed by @MasterTaffer's large input optimisation in #87. Large contexts process significantly faster, reducing the time to first generated token.
It seems like #87 fixed it, works like a charm now.
Still having this issue. Latest cuda build on this repo, as well as two other repos I tried (occam and oobabooga). Small/no context I get 8-10 tokens/second, with large context it starts saying 1-2 tokens/second, but that is due to long 20 second initial delay. the actual generation speed after it starts is just as fast regardless of context size.