Diego Devesa comments

Results 361 comments of


                                            Diego Devesa

llama_kv_cache_seq_shift does not work with cache type q4_0

I have opened #5653, but this requires changes in the backends and it is not priority at the moment.

Fix 2 cuda memory leaks in ggml-cuda.cu

The correct way to fix would be: - Move the automatic offloading logic from `ggml.c` to `ggml_backend_sched` - Make the pool private to the `ggml_backend` instance - Free the pool...

[MPI] Add support for per-node options, thread counts, and layer allocations

My suggestion would be to treat each MPI client as a different device in the same way they are treated in the CUDA and Vulkan backends, and allow the code...

[MPI] Add support for per-node options, thread counts, and layer allocations

You can definitely wrap another backend within a backend, I think that would work. Then the job of the MPI backend would be mainly to serialize the data and procedure...

AMD ROCm problem: GPU is constantly running at 100%

llama.cpp, and more specifically the CUDA backend, is single-threaded. While waiting for an user input, there is no other code running, there is no work being submitted to the GPU....

AMD ROCm problem: GPU is constantly running at 100%

It really isn't. I suggest that you take a look at AMD's debugging tools to try to understand what the GPU is doing while it should be idle, assuming that...

ggml : become thread-safe

What I meant is that I will work on this after the pipeline parallelism is merged, which is what I am doing. It will still take a while to complete,...

ggml : become thread-safe

https://github.com/ggerganov/llama.cpp/pull/6170 should fix this issue in the CUDA backend.

@martindevans It should be fixed, please report any issues with thread safety. For example, using multiple llama contexts simultaneously each with a different CUDA GPU on different threads should now...

ggml : become thread-safe

It should also be thread-safe, but I don't expect that to be a very useful use case.