turboderp
turboderp
It seems that 3b uses a head dimension of 100, which is a strange departure from 128 for all the other models. Some of the CUDA kernels assumed it would...
Perplexity will vary depending on the dataset. 7.53 looks reasonable for wikitext2, though. It's somewhat worse than what's normal for a 7b model, but that's what you'd expect from 3b....
Okay, I found another bug, specifically affecting 3b act-order models. With the latest commit I get ppl = 7.86, and I'm going to write off the difference as this model...
I'm quite happy that the 3b model works, anyway. I'm not surprised that it's very limited compared to 7b, but it could still be useful as a draft model for...
There is some room for optimization, yes, but it's difficult to keep tweaking ExLlama as long as every minor change has the potential to break something people have started to...
The size of the cache is: `2 * max_seq_len * num_hidden_layers * hidden_size * sizeof(float16)`. For a 2048-token context that works out to: - 7b: 1,024 MB - 13b: 1,600...
You should use the cpe value that's appropriate for the model, in any case. The 8k SuperHOT models are tuned for a factor of 4.0, regardless of how much of...
The implementation needs some temporary buffers on each device, yes. Not sure exactly how much it works out to, but it might be around 1.5 GB for 33b.
The K/V cache is split between GPUs. It's really collection of caches, one for each layer of the model, so if you put 20 layers on one GPU and 40...
Do you get this with other models as well? It sounds like it's related to Chromium, AMD drivers and Wayland..?