cpumaxx
cpumaxx
Is there anything special needed to see performance gains? I cloned/built/tested this PR branch and am seeing no change in performance on CPU (CUDA support flags disabled at compile time)
> For CPU, I think you need something that support bf16 acceleration like AVX512VNNI? > also, you need conversion script that just copy BF16 weight from py to GGUF to...
> I think the full implementation is in llamafile side. What should be expected in llama.cpp from this patch specifically? I'm seeing about 6% speed increase on prompt processing and...
> the next thing I'll do is upstream the llamafile bfloat16 kernels Nice. I'll keep an eye out for them. Is there a relevant branch on your llama.cpp fork I...
> Here's an example of what you should expect to see with that branch. > > ``` > llama_print_timings: load time = 773.90 ms > llama_print_timings: sample time = 0.46...
> Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created...
Update: I was suspicious of the large delta between the unified branch and master, so I manually downloaded the official mistral 7b and converted to FP16 gguf manually. Doing so...
I've re-run the tests with "-t 16 --numa isolate --no-mmap" flags in order to eliminate any confounding memory locality issues, and there is still the same 0.10t/s gap with FP16...
Is there anything else needed before this PR can be merged?
A quick test with R1 on llama-server shows all experts loaded into memory during warmup. Inference started immediately once the web interface was available. I will try a test on...