Johannes Gäßler
Johannes Gäßler
That check is there because that is the old check for FlashAttention and I forgot to change it.
If it's only token generation that is faster then this PR is pretty much pointless because the FlashAttention kernel for batch size 1 does not use tensor cores at all...
My current stance is that I don't think that a speedup of 2% is large enough to justify adding a dependency, especially when there is no dev with the ability...
Obsolete now that https://github.com/ggerganov/llama.cpp/pull/5021 has been merged.
>I was hoping to find a way to avoid this. The main reason is that the flash attention kernels can be extended to support quantized data (for quantum KV cache)...
Obsolete now that https://github.com/ggerganov/llama.cpp/pull/5021 has been merged.
Obsolete now that https://github.com/ggerganov/llama.cpp/pull/5021 has been merged.
I can't reproduce the issue on a GTX 1070 + GTX 1050 ti. What's the command that you are running which results in the error?
llama.cpp by default does not use half-precision floating point arithmetic. By default 32 bit floats are used. I recently bought a P40 and I plan to optimize performance for it,...
Move it to the idea list. My interest in GregTech is currently low and there are other things that would be of higher priority.