Johannes Gäßler
Johannes Gäßler
Are we talking about the same thing? Perhaps I should have clarified: I am talking about the performance for 33b q4_0 since 33b is the use case that I care...
I quickly tried implementing a kernel for q4_0 myself: https://github.com/JohannesGaessler/llama.cpp/commit/390f0a9b17c8a2060a6bbf75c5871dc2f4b42e58 On my hardware (GTX 1070) 7b perplexity on the first 100 lines of wikitext is 7% faster compared to the...
Performance numbers from my test machine with an i5-4570S, 16 GB of RAM @ 1600 MHz, and a GTX 1070 + a GTX 1050 ti: | Model | GPU |...
>You mean 30B? I can run 30B Q4_0 with my 8 GB card with 20 layers loaded only. "30B" seems to be a typo by Meta that has become dominant....
I added a comment to explain the weird device to host memcpy for split tensors. Since I as the person who wrote the code won't know: are there other parts...
I added a CLI argument that lets the user set the tensor split. On my system a less VRAM efficient split of 3:1 seems to do better than 2:1 because...
It's a server set up by my institute. The CPU is a Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. I don't know what kind of RAM they put in since...
Alright, unless I'm forgetting something this PR should now be ready to be merged from my end.
There seems to be an issue with f16 models.
I fixed f16. I should perhaps mention that this quantization type does **not** support multiple GPUs; I plan to work on better f16 support in the future (and see if...