Johannes Gäßler comments

Results 235 comments of


                                            Johannes Gäßler

[DRAFT] Speedup dequantize kernels

Are we talking about the same thing? Perhaps I should have clarified: I am talking about the performance for 33b q4_0 since 33b is the use case that I care...

[DRAFT] Speedup dequantize kernels

I quickly tried implementing a kernel for q4_0 myself: https://github.com/JohannesGaessler/llama.cpp/commit/390f0a9b17c8a2060a6bbf75c5871dc2f4b42e58 On my hardware (GTX 1070) 7b perplexity on the first 100 lines of wikitext is 7% faster compared to the...

Cuda refactor, multi GPU support

Performance numbers from my test machine with an i5-4570S, 16 GB of RAM @ 1600 MHz, and a GTX 1070 + a GTX 1050 ti: | Model | GPU |...

Cuda refactor, multi GPU support

>You mean 30B? I can run 30B Q4_0 with my 8 GB card with 20 layers loaded only. "30B" seems to be a typo by Meta that has become dominant....

Cuda refactor, multi GPU support

I added a comment to explain the weird device to host memcpy for split tensors. Since I as the person who wrote the code won't know: are there other parts...

Cuda refactor, multi GPU support

I added a CLI argument that lets the user set the tensor split. On my system a less VRAM efficient split of 3:1 seems to do better than 2:1 because...

Cuda refactor, multi GPU support

It's a server set up by my institute. The CPU is a Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. I don't know what kind of RAM they put in since...

Cuda refactor, multi GPU support

Alright, unless I'm forgetting something this PR should now be ready to be merged from my end.

Cuda refactor, multi GPU support

There seems to be an issue with f16 models.

Cuda refactor, multi GPU support

I fixed f16. I should perhaps mention that this quantization type does **not** support multiple GPUs; I plan to work on better f16 support in the future (and see if...