Diego Devesa
Diego Devesa
The main issue is that it cannot be assumed that all the graphs that `ggml_backend_cuda_graph_compute` receives are the same. Even within `llama.cpp`, the graphs will vary depending on the parameters...
With a debug build: [ 394/ 483] blk.18.ffn_gate_exps.weight - [ 4096, 14336, 8, 1], type = f32, converting to iq2_xxs .. quantize: ggml-quants.c:1313: nearest_int: Assertion `fval
This is good, this was the intended behavior when the host malloc fails, and not clearing the error was an oversight.
It is hard to diagnose the issue with so little information. The mlock error is not relevant to GPU acceleration, and it is not clear why you are using this...
Does this still happen with the current version? It should have been fixed in the latest ggml sync.
This shouldn't happen unless the same `ggml_backend` instance is being used in multiple threads simultaneously, which I believe that also implies that the same `whisper_context` is being used in multiple...
As you suggested, disabling the vmm allocator should fix the assert, but there are so many globals in the CUDA backend that are not synchronized that I can only imagine...
The CUDA backend should be thread safe now.
My goal in the long term to address this is to move the backends to dynamic libraries loadable at run time, then we could use a single build for all...
I also get `nan` ppl with that model with CUDA, so it does not seem specific to ROCm.