turboderp

Results 180 comments of turboderp

Hmm, going over the context length, even by one token, is definitely an error. It won't really do anything in regular GPTQ-for-LLaMa because the cache will just grow a bit...

Hmm, you shouldn't be getting corrupted output, but maybe it's because the model config disagrees with the cache config. In which case that shouldn't really be an option for the...

Just looking over the code it seems to use many of the same tricks as ExLlama. The CUDA kernels look very similar in places, but that's to be expected since...

Okay, so I did a quick test and I'm getting about 53 tokens/second for Llama-7B with vLLM. That's actually not bad at all considering they're running in FP16. I get...

It does, but Torch already uses it by default in `scaled_dot_product_attention`.

PyTorch in general seems to be optimized for training and inference on long sequences. Python itself becomes a real issue when the kernel launches don't queue up because they execute...

I commented on the reddit thread as well, and it does implement perplexity quite a bit. 1:1 - ppl = 6.31 1:2 - ppl = 7.42 1:4 - ppl =...

P40 isn't very well supported (yet). ExLlama relies heavily on FP16 math, and the P40 just has terrible FP16 performance. I'm not sure what to do about it, because adding...

Yep. I only tested how scaling affects the first part of the context, though. I didn't test perplexity over the full expanded context.

@Jeduh I'm still running tests at the moment, but dynamically scaling the embeddings, I *think*, would work poorly. The tuned model has some tolerance, but ideally you would tune it...