Johannes Gäßler comments

Results 235 comments of


                                            Johannes Gäßler

CUDA acceleration when using LoRAs

I forgot about this PR, sorry.

Applying lora with CUDA crashes with failed assertion

I'm on it.

Applying lora with CUDA crashes with failed assertion

I looked into the issue and quite frankly I don't think it's worth the effort to fix. Currently the CUDA code runs everything as f32 by default and it would...

ZLUDA for llama.cpp

>I'm getting a different text output than on an NVIDIA card. Is it ok? There is a binary called `perplexity` which - as the name implies - can be used...

ZLUDA for llama.cpp

>Wrt. performance. If compute capability is not enough information then ZLUDA could add a CUDA extension to surface whatever llama.cpp needs with the simplest bit being the underlying HIP device...

>tile sizes are fixed for a given architecture, llama.cpp compiles several variants for whatever architectures were chosen at compile time and the during run time llama.cpp code chooses appropriate kernel...

ZLUDA for llama.cpp

That should work for the CUDA code (and probably better than the current code). The question is what to do for HIP. There does seem to be an equivalent `hipFuncGetAttributes`...

ZLUDA for llama.cpp

I created a PR with some changes for q4_0: https://github.com/ggerganov/llama.cpp/pull/5554 . Is this how you imagined it?

(WIP) Implement stochastic speculative sampling

I recently worked with these files and should be able to review. However, I'm currently attending a scientific conference and will only be available next week.

(WIP) Implement stochastic speculative sampling

I read the paper and I do not understand how their proposed sampling method can be better than what they call "naive sampling". Fundamentally, if the probability distribution of the...