exllama icon indicating copy to clipboard operation
exllama copied to clipboard

Speed on A100

Open Ber666 opened this issue 1 year ago • 4 comments

Hi, thanks for the cool project. I am testing Llama-2-70B-GPTQ with 1 * A100 40G, the speed is around 9 t/s image Is this the expected speed? I noticed in some other issues that the code is only optimized for consumer GPUs, but I just wanted to double check if that's the expected speed or I made mistakes somewhere

Ber666 avatar Aug 30 '23 07:08 Ber666

I haven't tested 70B on A100 before, but the speed is close to what I've seen for 65B on A100, so I think this is about expected, yes.

turboderp avatar Aug 30 '23 08:08 turboderp

To give you another data point, with 70B I get 10 - 13 t/s per A100 80 GB (SXM4).

jday96314 avatar Sep 01 '23 03:09 jday96314

I can't believe that the a100 gets the same speed as the 3090. Maybe something can be improved here?

akaikite avatar Sep 11 '23 02:09 akaikite

There's definitely some room for improvement, but you're not going to see anything on the order of the difference in cost between the A100 and the 3090. When you're memory-bound, as you end up being here, what matters is that the A100 40G only has about 50-60% more global memory bandwidth than the 3090. So if the implementation is properly optimized and tuned for that architecture (ExLlama isn't, to be clear) then you're looking at 50-60% more tokens per second.

Now, if you're serving large batches, inference becomes compute-bound instead, and the A100 will outperform the 3090 very easily. But to serve large batches you also need a bunch more VRAM dedicated to state and cache. 40 GB won't get you very far, and even 80 GB is questionable. What use-case are you optimizing for, then? One quantized 70B model serving no more than 8 concurrent users, or something? A small business willing to invest in one A100 but not two, or three? Or if you're also trying to accommodate multi-A100 setups with tensor parallelism and whatnot, at what point does quantization stop making sense?

But yes, V2 is coming, and it's faster all around, including on the A100. So there's that.

turboderp avatar Sep 11 '23 11:09 turboderp