Max Krasnyansky
Max Krasnyansky
> as a heads-up armv8.7a will not work with older devices i.e. Pixel 6 pro devices (3 year's old device, 2021), even though these devices are running recent Android versions...
Folks can you give this #12995 a shot on your setups. Recent Windows on ARM64 builds started parking (offlining) the CPU cores more aggressively. Instead of changing OS/BIOS settings we...
> Thanks. It doesn't work for me as-is because the SetThreadInformation call isn't executed - I assume that's just for the thread pool? Yep. Threadpool and OMP. I keep forgetting...
@Alcpz This PR causes significant performance regression for Prompt processing because it creates a lot more chunks than before. Here is llama3.2-1B-Q4_0 running with 6 threads with instrumented matmul code....
> @max-krasnyansky I am using the graph-profiler branch but I'm unsure how to trigger and get the profiling details. Any docs, commands or references would be appreciated. Thanks. Sorry for...
> I think a good approach can be that for each `ggml_tensor` add another field besides `op` which records where this op is coming from so that we can differentiate...
@fmz @slaren I fixed one of the issues that was causing regressions. We were setting the default number of threads in the threadpool using `std::thread::hardware_concurrency()`. I updated that to use...
@fmz @slaren llama-bench has been updated as I described above. Here are the numbers from M2 Max. I'll share numbers for an AMD EPYC server, Snapdragon X-Elite and Gen-3 a...
> The performance looks better now, with nkvo it is comparable to OpenMP, which is very good. There is still a performance drop when using the BLAS backend (this includes...
@slaren @fmz I managed to further improve the threadpool signaling (reducing the number of wake-ups, etc) and also introduced the hybrid polling mode which is now the default. `--poll` now...