cpumaxx comments

Results 12 comments of


                                            cpumaxx

Introduce bfloat16 support

Is there anything special needed to see performance gains? I cloned/built/tested this PR branch and am seeing no change in performance on CPU (CUDA support flags disabled at compile time)

Introduce bfloat16 support

> For CPU, I think you need something that support bf16 acceleration like AVX512VNNI? > also, you need conversion script that just copy BF16 weight from py to GGUF to...

Introduce bfloat16 support

> I think the full implementation is in llamafile side. What should be expected in llama.cpp from this patch specifically? I'm seeing about 6% speed increase on prompt processing and...

Introduce bfloat16 support

> the next thing I'll do is upstream the llamafile bfloat16 kernels Nice. I'll keep an eye out for them. Is there a relevant branch on your llama.cpp fork I...

Introduce bfloat16 support

> Here's an example of what you should expect to see with that branch. > > ``` > llama_print_timings: load time = 773.90 ms > llama_print_timings: sample time = 0.46...

Introduce bfloat16 support

> Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created...

Introduce bfloat16 support

Update: I was suspicious of the large delta between the unified branch and master, so I manually downloaded the official mistral 7b and converted to FP16 gguf manually. Doing so...

Introduce bfloat16 support

I've re-run the tests with "-t 16 --numa isolate --no-mmap" flags in order to eliminate any confounding memory locality issues, and there is still the same 0.10t/s gap with FP16...

Update llava-cli.cpp to support comma-delimited image lists

Is there anything else needed before this PR can be merged?

Load all MoE experts during warmup

A quick test with R1 on llama-server shows all experts loaded into memory during warmup. Inference started immediately once the web interface was available. I will try a test on...