exllama Tesla P40 only using 70W underload

So my P40 is only using about 70W while generating responses, its not limited in any way(IE. Power delivery or temp)

Jun 19 '23 22:06 TimyIsCool

P40 isn't very well supported (yet). ExLlama relies heavily on FP16 math, and the P40 just has terrible FP16 performance.

I'm not sure what to do about it, because adding a whole second code path for FP32 would be a lot of work with a lot of maintenance afterwards. Inference would also use considerably more VRAM, so I'm not sure you'd be able to run e.g. a 33B model at full context length despite the 24 GB of VRAM.

For now I just don't have enough time to make it a priority. I have to focus on modern (consumer) GPUs.

Jun 19 '23 23:06 turboderp

Is there anything I can do to improve performance on the P40? Also is there any way to lower normal ram usage?

Jun 20 '23 21:06 TimyIsCool

Same here. I'm seeing 20+ tok/s on a 13B model with gptq-for-llama/autogptq and 3-4 toks/s with exllama on my P40.

There is a flag for gptq/torch called use_cuda_fp16 = False that gives a massive speed boost -- is it possible to do something similar in exllama?

Also with gptq, I can load 33B models using only 20GB VRAM (with fp16 = False).

Jun 23 '23 20:06 LoopControl

There is a flag for gptq/torch called use_cuda_fp16 = False that gives a massive speed boost -- is it possible to do something similar in exllama?

Well, it would give a massive boost on the P40 because of its really poor FP16 support. It would slow things down a lot on newer GPUs. More importantly it would require a lot of extra work, basically a whole new code path that would essentially just be for the P40. Although I'd like have good P40 support, right now I'd rather invite someone else to contribute the code for it.

Also with gptq, I can load 33B models using only 20GB VRAM (with fp16 = False).

Can you use the full context as well?

Jun 24 '23 20:06 turboderp

Doesn't all pascal cards, older and maybe turning cards have bad FP16 performance?

Jun 25 '23 13:06 TimyIsCool

P40 weirdness seems to be even stranger than just "it's slow". I wanted to chart VRAM usage for different models at different prompt context sizes, and the results were... impossible?

Guanaco 33b
Context		        VRAM (GB)	        T/S (P40)	Coherent?
Idle 		        16.6		        N/A		N/A
1k			19.9		        0.34	        No
2k			OOM		        N/A		N/A


Tulu 30b
Context		        VRAM (GB)	        T/S (P40)	Coherent?
Idle		        16.7		        N/A		N/A
1k			23.1		        0.38		Yes
2k			23.2		        0.22		Yes(ish)
3k			23.4		        0.15		Yes
4k			23.5		        0.13		Yes
6k			OOM		        N/A		N/A


Nous-Hermes 13b
Context		        VRAM (GB)	        T/S (P40)	Coherent?
Idle		        7.5			N/A		N/A
1k			14.0		        0.96		Yes
2k			14.2		        0.54		Yes
3k			14.3		        0.38		Yes
4k			14.5		        0.21		Yes
6k			14.7		        0.02		No
7.5k		        14.9		        0.05		No


Manticore-chat 13b
Context		        VRAM (GB)	        T/S (P40)	Coherent?
Idle		        7.6			N/A		N/A
1k			14.0		        0.96		Yes
2k			14.3		        0.36		Yes
6k			14.7		        0		No

Either exllama has somehow made the context/memory requirement curve sub-linear, or it's really, really broken on Pascal.

Jun 28 '23 19:06 candre23

I'm curious how you're configuring the model in this case? If you're running with max_seq_len = 8192 in all cases, then the model is correctly allocating the full cache in advance.

And is this on the latest commit? If so then the VRAM usage should top out at the 2k sequence length unless you're overriding the default max_input_len and max_attention_size.

Jun 28 '23 19:06 turboderp

Ah, my apologies. I had no idea it was allocating memory for max context, regardless of how much context was actually being fed in. In retrospect, that perfectly explains what I'm seeing, but as autogptq and llamacpp claim VRAM as needed, it didn't even occur to me that was what you are doing.

Jun 28 '23 20:06 candre23

Yep. It's necessary to avoid memory fragmentation, but it also makes more sense to me to allocate up front what you can predict you're eventually going to need anyway. But as to that, if this is on the latest commit you still shouldn't be seeing the slight increase in VRAM usage after 2k tokens. And I'm concerned about it going incoherent after about 5k tokens. That was supposed to be fixed.

Jun 28 '23 21:06 turboderp

I'm actually doing this in oobabooga, not exllama proper. My ooba install is up to date, but I have no clue if their implementation is up to date with your repo. That likely explains the VRAM increases. So I'm probably wasting your time on that front as well.

I'm happy to test it directly on exllama if you want.

Jun 28 '23 21:06 candre23

Nah, it's fine. It explains it well enough, since it looks like they're a little behind with their packaging of ExLlama as a library.

Jun 28 '23 21:06 turboderp

(First spike is it normally) My P40 gets stuck every other message and then sillytavern times it out, I get about 3.7 tokens a second (On AVG) but it hangs every other message

Jun 29 '23 19:06 TimyIsCool

@TimyIsCool As mentioned above by @turboderp, fp16 performance on p40 means exllama is going to be slow.

Try autogptq/gptq-for-llama loaders instead.

Jul 01 '23 00:07 LoopControl

its not that though, it does work, as shown by that spike, i sometimes get a run of 4 or 5 within 5 seconds then it just hangs and the others just dont work on the P40

Jul 01 '23 01:07 TimyIsCool

exllama exllama copied to clipboard

Tesla P40 only using 70W underload

exllama
exllama copied to clipboard