Diego Devesa comments

Results 361 comments of


                                            Diego Devesa

DRAFT: Introduction of CUDA Graphs to LLama.cpp

The main issue is that it cannot be assumed that all the graphs that `ggml_backend_cuda_graph_compute` receives are the same. Even within `llama.cpp`, the graphs will vary depending on the parameters...

Inline style inheritance

With a debug build: [ 394/ 483] blk.18.ffn_gate_exps.weight - [ 4096, 14336, 8, 1], type = f32, converting to iq2_xxs .. quantize: ggml-quants.c:1313: nearest_int: Assertion `fval

Fixed WSL cuda's OOM error

This is good, this was the intended behavior when the host malloc fails, and not clearing the error was an oversight.

Can't offload layers to GPU

It is hard to diagnose the issue with so little information. The mlock error is not relevant to GPU acceleration, and it is not clear why you are using this...

Panic when using go binding with CUDA compiled library

Does this still happen with the current version? It should have been fixed in the latest ggml sync.

Panic when using go binding with CUDA compiled library

This shouldn't happen unless the same `ggml_backend` instance is being used in multiple threads simultaneously, which I believe that also implies that the same `whisper_context` is being used in multiple...

Multi Threaded issue with CUDA compiled library 1.5.4

As you suggested, disabling the vmm allocator should fix the assert, but there are so many globals in the CUDA backend that are not synchronized that I can only imagine...

Multi Threaded issue with CUDA compiled library 1.5.4

The CUDA backend should be thread safe now.

Dynamic CUDA driver loader

My goal in the long term to address this is to move the backends to dynamic libraries loadable at run time, then we could use a single build for all...

Is it normal that ROCm+HIPBLAS produces different results than on CPU or breaks completely?

I also get `nan` ppl with that model with CUDA, so it does not seem specific to ROCm.