LostRuins Concedo
LostRuins Concedo
Just wondering for all those who have tried, how much speedup do you get in the batched **prompt eval timings** vs openblas (not perplexity calculations)? Would be good to benchmark...
> I would bring up CLBlast as it's been implemented over at https://github.com/LostRuins/koboldcpp/ and isn't Nvidia-exclusive, but from my experience, speed ups are minor or just ends up being slower...
@slaren @0cc4m we've solved the issue - apparently there was code in the llama.cpp file that made the graph switch to single threaded mode during BLAS calculations - understandable for...
In case anyone is concerned - Occ4m is the main developer for the code relating to the CLBlast kernels and implementation, and we are fine with this code being merged...
@philpax No, there are no issue with determining ftype in the file for me so far. Modulo for ftype is only required for ggml magic files (`0x67676d6c`), not ggjt (`0x67676a74`),...
@philpax I'm personally more in the camp of - if it's not broken don't fix it - so given that the version+ftype multiplex was already added to existing ggml and...
But it has to be consistent. Leaving the standard freely extensible but undefined can rapidly lead to fracturing formats as each implementation adds their own keys. That's how you get...
Also this really isn't a llamacpp issue unless it's a tokenizer problem. You can confirm whether the input tokens match the vocab.
If applying patch as a polyfill, it should be forwards compatible.
Ah I get that. But I would say that *this repo* has sort of become *the* de-facto standard, as all implementations I know of are based off the code here....