Johannes Gäßler comments

Results 235 comments of


                                            Johannes Gäßler

Fix flash-attn for AMD

That check is there because that is the old check for FlashAttention and I forgot to change it.

Fix flash-attn for AMD

If it's only token generation that is faster then this PR is pretty much pointless because the FlashAttention kernel for batch size 1 does not use tensor cores at all...

Fix flash-attn for AMD

My current stance is that I don't think that a speedup of 2% is large enough to justify adding a dependency, especially when there is no dev with the ability...

Fix flash-attn for AMD

Obsolete now that https://github.com/ggerganov/llama.cpp/pull/5021 has been merged.

FlashAttention: pragma unroll, use_mask template parameter

>I was hoping to find a way to avoid this. The main reason is that the flash attention kernels can be extended to support quantized data (for quantum KV cache)...

FlashAttention: pragma unroll, use_mask template parameter

Obsolete now that https://github.com/ggerganov/llama.cpp/pull/5021 has been merged.

Fused attention kernel for small batch sizes

Obsolete now that https://github.com/ggerganov/llama.cpp/pull/5021 has been merged.

No longer runs on a Telsa P40 (Pascal, sm61)

I can't reproduce the issue on a GTX 1070 + GTX 1050 ti. What's the command that you are running which results in the error?

No longer runs on a Telsa P40 (Pascal, sm61)

llama.cpp by default does not use half-precision floating point arithmetic. By default 32 bit floats are used. I recently bought a P40 and I plan to optimize performance for it,...

Plutonium decay chain contains the wrong Radon Isotope

Move it to the idea list. My interest in GregTech is currently low and there are other things that would be of higher priority.