Johannes Gäßler comments

Results 235 comments of


                                            Johannes Gäßler

Is it normal that ROCm+HIPBLAS produces different results than on CPU or breaks completely?

It's already close to 2 AM where I live but I think the 160K tokens you use as input may simply not be enough. I'll do some related testing tomorrow.

Is it normal that ROCm+HIPBLAS produces different results than on CPU or breaks completely?

Using Wikitext-103 train as input and the models that I already have available I am so far not able to provoke `imatrix` into producing NaN values. `imatrix` calculates sums of...

Is it normal that ROCm+HIPBLAS produces different results than on CPU or breaks completely?

>so how would a sum of something + nans turn out into something not a nan, something you claimed would happen? Nobody has seen this happen, and numerically it cannot...

New optimization from NVIDIA to use CUDA Graphs in llama.cpp

Regarding ggml graph creation overhead: I think the impact of this will heavily depend on the baseline t/s you can get with a given model. Presumably you're investigating the impact...

Investigate PagedAttention KV-cache memory management for faster inference

llama.cpp currently only ever serves one user at a time so this optimization is not applicable.

Investigate PagedAttention KV-cache memory management for faster inference

Yes, for enterprise use where you have one server generating responses for many users in parallel the optimization would be useful.

Investigate PagedAttention KV-cache memory management for faster inference

I don't have any plans for it because I don't care about commercial use but I can't speak for the other devs.

Investigate PagedAttention KV-cache memory management for faster inference

I'm not really concerned with what other people want to use llama.cpp for. I'm implementing things that are useful for me personally first and foremost. And I don't see how...

Fix flash-attn for AMD

Sorry, I forgot: you need to install rocWMMA https://github.com/ROCm/rocWMMA .

Fix flash-attn for AMD

Well, this looks like it would be non-trivial to fix. I was hoping it would be possible to just use rocWMMA as a drop-in replacement. But as I said, I...