llama.cpp ggml : become thread-safe

ref https://github.com/ggerganov/llama.cpp/discussions/499#discussioncomment-7478602

We should be able to run inference on multiple graphs, backends and devices in parallel. Currently, there are CUDA singletons that break this requirement and possibly there could be other problems.

Nov 05 '23 15:11 ggerganov

Any updates?

Feb 27 '24 22:02 pseudotensor

FYI, I noticed with recent llama.cpp that while there are speed improvements, the thread safety has gotten worse. Now when I run in 2 threads a TTS model and a GGUF model, everything crashes in latest llama.cpp when did not used to.

I get:

CUDA error: an illegal memory access was encountered
  current device: 1, in function ggml_backend_cuda_buffer_cpy_tensor at /tmp/pip-install-r9_ew1og/llama-cpp-python_858d7004c05a49ac876ecf41f562f687/vendor/llama.cpp/ggml-cuda.cu:11578
  cudaDeviceSynchronize()
GGML_ASSERT: /tmp/pip-install-r9_ew1og/llama-cpp-python_858d7004c05a49ac876ecf41f562f687/vendor/llama.cpp/ggml-cuda.cu:255: !"CUDA error"
!!!! kernel execution error. (m: 3072, n: 77, k: 1024, error: 13)

Worked perfectly with heavy use before.

Mar 06 '24 20:03 pseudotensor

I am a bit confused. I thought that @slaren solved this problem with 'llama : add pipeline parallelism support (https://github.com/ggerganov/llama.cpp/pull/6017)'? Or do you mean here something else?

@slaren said that when he was ready with 6017 he will fix the backend to release all CUDA memory, what is a big problem to many of us. I am not impatient just would like to understand what is happening.

Mar 14 '24 13:03 zsogitbe

What I meant is that I will work on this after the pipeline parallelism is merged, which is what I am doing. It will still take a while to complete, as fixing this will require infrastructure changes in other parts of the code. Sorry for the confusion.

Mar 14 '24 13:03 slaren

I understand. Thank you that you care about this issue and that you will work on it! I have tried to solve it but I could not.

Mar 14 '24 13:03 zsogitbe

#5396

Mar 19 '24 06:03 DEVDEVIL007

https://github.com/ggerganov/llama.cpp/pull/6170 should fix this issue in the CUDA backend.

Mar 20 '24 10:03 slaren

This issue was closed because it has been inactive for 14 days since being marked as stale.

May 05 '24 01:05 github-actions[bot]

Is this actually fixed?

May 05 '24 01:05 martindevans

Some additional problems https://github.com/ggerganov/llama.cpp/issues/6909 maybe because of this fix?

May 05 '24 04:05 zsogitbe

@martindevans It should be fixed, please report any issues with thread safety. For example, using multiple llama contexts simultaneously each with a different CUDA GPU on different threads should now be possible. CPU and Metal also should be thread-safe, other backends probably not.

May 08 '24 00:05 slaren

That's great to hear! I'll experiment with removing some of the locks we added into LLamaSharp and will report any bugs. Thanks.

May 08 '24 00:05 martindevans

What about same GPU? Why isn't that thread safe too?

May 08 '24 00:05 pseudotensor

It should also be thread-safe, but I don't expect that to be a very useful use case.

May 08 '24 01:05 slaren

llama.cpp llama.cpp copied to clipboard

ggml : become thread-safe

llama.cpp
llama.cpp copied to clipboard