turboderp comments

Results 180 comments of


                                            turboderp

Issue with open_llama_3b quantization by GPTQ-for-Llama

It seems that 3b uses a head dimension of 100, which is a strange departure from 128 for all the other models. Some of the CUDA kernels assumed it would...

Issue with open_llama_3b quantization by GPTQ-for-Llama

Perplexity will vary depending on the dataset. 7.53 looks reasonable for wikitext2, though. It's somewhat worse than what's normal for a 7b model, but that's what you'd expect from 3b....

Issue with open_llama_3b quantization by GPTQ-for-Llama

Okay, I found another bug, specifically affecting 3b act-order models. With the latest commit I get ppl = 7.86, and I'm going to write off the difference as this model...

Issue with open_llama_3b quantization by GPTQ-for-Llama

I'm quite happy that the 3b model works, anyway. I'm not surprised that it's very limited compared to 7b, but it could still be useful as a draft model for...

Issue with open_llama_3b quantization by GPTQ-for-Llama

There is some room for optimization, yes, but it's difficult to keep tweaking ExLlama as long as every minor change has the potential to break something people have started to...

OOM/CUDA errors when running in batch mode?

The size of the cache is: `2 * max_seq_len * num_hidden_layers * hidden_size * sizeof(float16)`. For a 2048-token context that works out to: - 7b: 1,024 MB - 13b: 1,600...

OOM/CUDA errors when running in batch mode?

You should use the cpe value that's appropriate for the model, in any case. The 8k SuperHOT models are tuned for a factor of 4.0, regardless of how much of...

Question: Does GPU splitting take more ram than running on a single GPU?

The implementation needs some temporary buffers on each device, yes. Not sure exactly how much it works out to, but it might be around 1.5 GB for 33b.

Question: Does GPU splitting take more ram than running on a single GPU?

The K/V cache is split between GPUs. It's really collection of caches, one for each layer of the model, so if you put 20 layers on one GPU and 40...

libva error: vaGetDriverNameByIndex() failed with unknown libva error, driver_name = (null)

Do you get this with other models as well? It sounds like it's related to Chromium, AMD drivers and Wayland..?