exllama
exllama copied to clipboard
Tesla P40 only using 70W underload
So my P40 is only using about 70W while generating responses, its not limited in any way(IE. Power delivery or temp)
P40 isn't very well supported (yet). ExLlama relies heavily on FP16 math, and the P40 just has terrible FP16 performance.
I'm not sure what to do about it, because adding a whole second code path for FP32 would be a lot of work with a lot of maintenance afterwards. Inference would also use considerably more VRAM, so I'm not sure you'd be able to run e.g. a 33B model at full context length despite the 24 GB of VRAM.
For now I just don't have enough time to make it a priority. I have to focus on modern (consumer) GPUs.
Is there anything I can do to improve performance on the P40? Also is there any way to lower normal ram usage?
Same here. I'm seeing 20+ tok/s on a 13B model with gptq-for-llama/autogptq and 3-4 toks/s with exllama on my P40.
There is a flag for gptq/torch called use_cuda_fp16 = False that gives a massive speed boost -- is it possible to do something similar in exllama?
Also with gptq, I can load 33B models using only 20GB VRAM (with fp16 = False).
There is a flag for gptq/torch called use_cuda_fp16 = False that gives a massive speed boost -- is it possible to do something similar in exllama?
Well, it would give a massive boost on the P40 because of its really poor FP16 support. It would slow things down a lot on newer GPUs. More importantly it would require a lot of extra work, basically a whole new code path that would essentially just be for the P40. Although I'd like have good P40 support, right now I'd rather invite someone else to contribute the code for it.
Also with gptq, I can load 33B models using only 20GB VRAM (with fp16 = False).
Can you use the full context as well?
Doesn't all pascal cards, older and maybe turning cards have bad FP16 performance?
P40 weirdness seems to be even stranger than just "it's slow". I wanted to chart VRAM usage for different models at different prompt context sizes, and the results were... impossible?
Guanaco 33b
Context VRAM (GB) T/S (P40) Coherent?
Idle 16.6 N/A N/A
1k 19.9 0.34 No
2k OOM N/A N/A
Tulu 30b
Context VRAM (GB) T/S (P40) Coherent?
Idle 16.7 N/A N/A
1k 23.1 0.38 Yes
2k 23.2 0.22 Yes(ish)
3k 23.4 0.15 Yes
4k 23.5 0.13 Yes
6k OOM N/A N/A
Nous-Hermes 13b
Context VRAM (GB) T/S (P40) Coherent?
Idle 7.5 N/A N/A
1k 14.0 0.96 Yes
2k 14.2 0.54 Yes
3k 14.3 0.38 Yes
4k 14.5 0.21 Yes
6k 14.7 0.02 No
7.5k 14.9 0.05 No
Manticore-chat 13b
Context VRAM (GB) T/S (P40) Coherent?
Idle 7.6 N/A N/A
1k 14.0 0.96 Yes
2k 14.3 0.36 Yes
6k 14.7 0 No
Either exllama has somehow made the context/memory requirement curve sub-linear, or it's really, really broken on Pascal.
I'm curious how you're configuring the model in this case? If you're running with max_seq_len = 8192 in all cases, then the model is correctly allocating the full cache in advance.
And is this on the latest commit? If so then the VRAM usage should top out at the 2k sequence length unless you're overriding the default max_input_len and max_attention_size.
Ah, my apologies. I had no idea it was allocating memory for max context, regardless of how much context was actually being fed in. In retrospect, that perfectly explains what I'm seeing, but as autogptq and llamacpp claim VRAM as needed, it didn't even occur to me that was what you are doing.
Yep. It's necessary to avoid memory fragmentation, but it also makes more sense to me to allocate up front what you can predict you're eventually going to need anyway. But as to that, if this is on the latest commit you still shouldn't be seeing the slight increase in VRAM usage after 2k tokens. And I'm concerned about it going incoherent after about 5k tokens. That was supposed to be fixed.
I'm actually doing this in oobabooga, not exllama proper. My ooba install is up to date, but I have no clue if their implementation is up to date with your repo. That likely explains the VRAM increases. So I'm probably wasting your time on that front as well.
I'm happy to test it directly on exllama if you want.
Nah, it's fine. It explains it well enough, since it looks like they're a little behind with their packaging of ExLlama as a library.
(First spike is it normally)
My P40 gets stuck every other message and then sillytavern times it out, I get about 3.7 tokens a second (On AVG) but it hangs every other message
@TimyIsCool As mentioned above by @turboderp, fp16 performance on p40 means exllama is going to be slow.
Try autogptq/gptq-for-llama loaders instead.
its not that though, it does work, as shown by that spike, i sometimes get a run of 4 or 5 within 5 seconds then it just hangs and the others just dont work on the P40