Johannes Gäßler comments

Results 235 comments of


                                            Johannes Gäßler

Cuda refactor, multi GPU support

I was using very short prompt for testing. There is an issue with long prompts.

Cuda refactor, multi GPU support

Can you quickly check whether the code produces correct results with `--tensor-split 1,0,0,0`?

Cuda refactor, multi GPU support

I fixed the issue with prompt processing. f16 still seems to have a bug somewhere with multiple GPUs.

Cuda refactor, multi GPU support

I fixed the f16 issues. As a side effect f16 t/s also went up by ~100% because until now it was always using the general f16 f32 matrix multiplication function...

Cuda refactor, multi GPU support

@huichen Can you do another test? The problem may have been caused by me only using one cuBLAS handle instead of one per GPU.

Cuda refactor, multi GPU support

Yes, as I've said a dozen times before: I have seen the exllama repository. But it's not as simple as copy-pasting code from one project to another. I already know...

Cuda refactor, multi GPU support

Don't worry, I'm patient. Currently I'm working on GPU acceleration for the remaining tensors. If I get a working version before this PR gets merged, should I just keep pushing...

Cuda refactor, multi GPU support

You need to set the --n-gpu-layers CLI argument to utilize the GPUs.

Leverage mmap to offloading the tensors

This is similar to the first version that I implemented in this PR: https://github.com/ggerganov/llama.cpp/pull/1483 . I was requested to change it in such a way that the cuda code be...

Leverage mmap to offloading the tensors

Have you tested the performance of this PR for the case in which the size of the model is larger than the amount of RAM?