Johannes Gäßler

Results 235 comments of Johannes Gäßler

I was using very short prompt for testing. There is an issue with long prompts.

Can you quickly check whether the code produces correct results with `--tensor-split 1,0,0,0`?

I fixed the issue with prompt processing. f16 still seems to have a bug somewhere with multiple GPUs.

I fixed the f16 issues. As a side effect f16 t/s also went up by ~100% because until now it was always using the general f16 f32 matrix multiplication function...

@huichen Can you do another test? The problem may have been caused by me only using one cuBLAS handle instead of one per GPU.

Yes, as I've said a dozen times before: I have seen the exllama repository. But it's not as simple as copy-pasting code from one project to another. I already know...

Don't worry, I'm patient. Currently I'm working on GPU acceleration for the remaining tensors. If I get a working version before this PR gets merged, should I just keep pushing...

You need to set the --n-gpu-layers CLI argument to utilize the GPUs.

This is similar to the first version that I implemented in this PR: https://github.com/ggerganov/llama.cpp/pull/1483 . I was requested to change it in such a way that the cuda code be...

Have you tested the performance of this PR for the case in which the size of the model is larger than the amount of RAM?