turboderp comments

Results 180 comments of


                                            turboderp

RuntimeError: CUDA error: an illegal memory access was encountered

Hmm, going over the context length, even by one token, is definitely an error. It won't really do anything in regular GPTQ-for-LLaMa because the cache will just grow a bit...

RuntimeError: CUDA error: an illegal memory access was encountered

Hmm, you shouldn't be getting corrupted output, but maybe it's because the model config disagrees with the cache config. In which case that shouldn't really be an option for the...

Benchmarks vs vLLM?

Just looking over the code it seems to use many of the same tricks as ExLlama. The CUDA kernels look very similar in places, but that's to be expected since...

Benchmarks vs vLLM?

Okay, so I did a quick test and I'm getting about 53 tokens/second for Llama-7B with vLLM. That's actually not bad at all considering they're running in FP16. I get...

Benchmarks vs vLLM?

It does, but Torch already uses it by default in `scaled_dot_product_attention`.

Benchmarks vs vLLM?

PyTorch in general seems to be optimized for training and inference on long sequences. Python itself becomes a real issue when the kernel launches don't queue up because they execute...

Interesting method to extend a model's max context length.

I commented on the reddit thread as well, and it does implement perplexity quite a bit. 1:1 - ppl = 6.31 1:2 - ppl = 7.42 1:4 - ppl =...

Tesla P40 only using 70W underload

P40 isn't very well supported (yet). ExLlama relies heavily on FP16 math, and the P40 just has terrible FP16 performance. I'm not sure what to do about it, because adding...

Interesting method to extend a model's max context length.

Yep. I only tested how scaling affects the first part of the context, though. I didn't test perplexity over the full expanded context.

Interesting method to extend a model's max context length.

@Jeduh I'm still running tests at the moment, but dynamically scaling the embeddings, I *think*, would work poorly. The tuned model has some tolerance, but ideally you would tune it...