exllama
exllama copied to clipboard
Benchmarks vs vLLM?
https://github.com/vllm-project/vllm just released publicly, claiming to be an inference library that accelerates HF Transformers by 24x
Just looking over the code it seems to use many of the same tricks as ExLlama. The CUDA kernels look very similar in places, but that's to be expected since there are some obvious places it's just silly not to fuse operations together. Like, the gated activation really doesn't need to be two separate kernels, so hey.
They've also identified memory fragmentation around the K/V cache as a huge problem with the way HF transformer models works, and they dedicate most of the blog post to talking about their solution. I think their paging system sounds very interesting, but they might be hyping it up a little too much. Like here:
Dynamic: Its size depends on the sequence length, which is highly variable and unpredictable. As a result, efficiently managing the KV cache presents a significant challenge.
I totally disagree with that. Well, not totally, they're right about the simple concatenation in Transformers being a really bad solution, but you can also just allocate the memory up-front and never release it. And it's not unpredictable at all: you need to reserve space for the full sequence length you intend to serve, otherwise you're just setting yourself up for OoM errors at an unpredictable time later. And incidentally that also makes the solution really simple. But, there are potentially other benefits to the paging system that I'm keen to read more about.
I suspect they're leaving some performance on the table by relying on Torch for linear layers, since in my experience Torch isn't very good at matrix-vector multiplications. It matters less with batches, obviously.
Overall I suspect their implementation is quite efficient, but I'm very confused by their benchmarks. They're measuring in requests per minute but I can't see them specify anywhere what a "request" is. Surely the speed will depend on the number of tokens generated...?
Anyway, it's never going to be a fair comparison between vLLM and ExLlama because they're not using quantized models and ExLlama uses only quantized models. I'll see if maybe I can't get a 7B model to load, though, and compare it anyway.
Okay, so I did a quick test and I'm getting about 53 tokens/second for Llama-7B with vLLM.
That's actually not bad at all considering they're running in FP16. I get about 170 t/s, but that's for 4-bit quantized using only 30% of the VRAM. So it makes sense that vLLM would have about 30% of the speed, if both implementations are bumping up against the bandwidth limit on the 4090.
Interestingly, vLLM seems unaffected by context length, while I see upwards of a 20% difference between short and long contexts with ExLlama. This could be because they're more heavily bottlenecked in the linear layers, so attention has less of an impact, but it could also suggest their attention kernel is faster than what I'm using (which is just PyTorch SDP for long sequences). So I'll have to dig into that.
xformers has a very good memory efficient attention implementation. Not sure if you are aware.
It does, but Torch already uses it by default in scaled_dot_product_attention.
PyTorch in general seems to be optimized for training and inference on long sequences. Python itself becomes a real issue when the kernel launches don't queue up because they execute much faster than the Python interpreter can keep up. And conventional wisdom about attention having quadratic complexity simply isn't true in the case that's relevant here, where we're generating one token at a time. That results in linear complexity, and attention ends up being bandwidth-limited like everything else. It also takes a considerable context length before attention starts to slow things down noticeably, since every other part of the inference is O(1).