Diego Devesa
Diego Devesa
I have opened #5653, but this requires changes in the backends and it is not priority at the moment.
The correct way to fix would be: - Move the automatic offloading logic from `ggml.c` to `ggml_backend_sched` - Make the pool private to the `ggml_backend` instance - Free the pool...
My suggestion would be to treat each MPI client as a different device in the same way they are treated in the CUDA and Vulkan backends, and allow the code...
You can definitely wrap another backend within a backend, I think that would work. Then the job of the MPI backend would be mainly to serialize the data and procedure...
llama.cpp, and more specifically the CUDA backend, is single-threaded. While waiting for an user input, there is no other code running, there is no work being submitted to the GPU....
It really isn't. I suggest that you take a look at AMD's debugging tools to try to understand what the GPU is doing while it should be idle, assuming that...
What I meant is that I will work on this after the pipeline parallelism is merged, which is what I am doing. It will still take a while to complete,...
https://github.com/ggerganov/llama.cpp/pull/6170 should fix this issue in the CUDA backend.
@martindevans It should be fixed, please report any issues with thread safety. For example, using multiple llama contexts simultaneously each with a different CUDA GPU on different threads should now...
It should also be thread-safe, but I don't expect that to be a very useful use case.