Johannes Gäßler
Johannes Gäßler
I forgot about this PR, sorry.
I looked into the issue and quite frankly I don't think it's worth the effort to fix. Currently the CUDA code runs everything as f32 by default and it would...
>I'm getting a different text output than on an NVIDIA card. Is it ok? There is a binary called `perplexity` which - as the name implies - can be used...
>Wrt. performance. If compute capability is not enough information then ZLUDA could add a CUDA extension to surface whatever llama.cpp needs with the simplest bit being the underlying HIP device...
>tile sizes are fixed for a given architecture, llama.cpp compiles several variants for whatever architectures were chosen at compile time and the during run time llama.cpp code chooses appropriate kernel...
That should work for the CUDA code (and probably better than the current code). The question is what to do for HIP. There does seem to be an equivalent `hipFuncGetAttributes`...
I created a PR with some changes for q4_0: https://github.com/ggerganov/llama.cpp/pull/5554 . Is this how you imagined it?
I recently worked with these files and should be able to review. However, I'm currently attending a scientific conference and will only be available next week.
I read the paper and I do not understand how their proposed sampling method can be better than what they call "naive sampling". Fundamentally, if the probability distribution of the...