Akarshan Biswas
Akarshan Biswas
@easyfab Possible, since before my revert, the backend did not use mmvq at all. The crash is happening at the model warmup after loading the model into the memory while...
It should be in ~/.cache .
@semidark You can try testing with the patch and with warmup to see if it is still crashing... The patch restores the original behavior before the commit you mentioned but...
@semidark Run llama-bench like this comment > And llama-bench is ok : > > ``` > llama-bench.exe -m E:\models\Meta-Llama-3.1-8B-Instruct-Q4_K_M\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -ngl 99 > ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no > ggml_sycl_init: SYCL_USE_XMX: yes >...
@semidark Please see the last bottom column for both the cases. pp is prompt processing tg is text generation For example: pp of 86.50 ± 11.47 tokens/sec for patched version...
I think "Total freeze" while loading model is probably related to a display driver problem. I also have an Arc GPU and I do not have this problem. This particular...
Run `./test-backend-ops -b SYCL0` and paste the output here.
@NineMeowICT Seems like https://github.com/ggerganov/llama.cpp/issues/9612#issuecomment-2405473195
Please note: no need to disable prompt caching. The culprit here is flash attention. If not supported by the backend, the attention layers will be offloaded to CPU for prompt...
> ggml_metal_library_compile_pipeline: error: failed to compile pipeline: base = 'kernel_mul_mv_bf16_f32_4', name = 'kernel_mul_mv_bf16_f32_4_nsg=4' Sounds like llama.cpp. Metal kernel compile failed.