Johannes Gäßler comments

Results 235 comments of


                                            Johannes Gäßler

Multi GPU support, CUDA refactor, CUDA scratch buffer

I've pushed a rebased version. The new quantization formats seem to be working correctly in combination with multi GPU. The CI will take some time anyways so I will quickly...

Multi GPU support, CUDA refactor, CUDA scratch buffer

As far as I can tell everything is working correctly. Performance is good as long as I don't forget to disable debug options.

Multi GPU support, CUDA refactor, CUDA scratch buffer

I get `Error: connect ECONNREFUSED 127.0.1.1:8080` when I try to run the server example but I get the same error on master so I'm assuming that it has nothing to...

Multi GPU support, CUDA refactor, CUDA scratch buffer

Thank you for being patient with me.

Multi GPU support, CUDA refactor, CUDA scratch buffer

I didn't investigate what the minimum compute capability is. The multi GPU code does work on 4x GTX Titan X though which have a compute capability of 5.2.

Multi GPU support, CUDA refactor, CUDA scratch buffer

I don't see why the generation would be done on the CPU. I think the problem is rather that one of the operations that I used has good performance on...

Multi GPU support, CUDA refactor, CUDA scratch buffer

https://github.com/ggerganov/llama.cpp/tree/master/examples/main#additional-options

Multi GPU support, CUDA refactor, CUDA scratch buffer

Sorry, it seems I made a mistake at some point and didn't catch it during review. This is not intended.

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage

I was thinking recently that better threading would be nice to have. Anyways, I didn't yet look at the PR in detail but I can already give you feedback regarding...

[Review] Merge PowerInfer with llama.cpp mainline

I've written about it here: https://github.com/ggerganov/llama.cpp/discussions/4534#discussioncomment-7900305