llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

ggml : become thread-safe

Open ggerganov opened this issue 2 years ago • 7 comments

ref https://github.com/ggerganov/llama.cpp/discussions/499#discussioncomment-7478602

We should be able to run inference on multiple graphs, backends and devices in parallel. Currently, there are CUDA singletons that break this requirement and possibly there could be other problems.

ggerganov avatar Nov 05 '23 15:11 ggerganov

Any updates?

pseudotensor avatar Feb 27 '24 22:02 pseudotensor

FYI, I noticed with recent llama.cpp that while there are speed improvements, the thread safety has gotten worse. Now when I run in 2 threads a TTS model and a GGUF model, everything crashes in latest llama.cpp when did not used to.

I get:

CUDA error: an illegal memory access was encountered
  current device: 1, in function ggml_backend_cuda_buffer_cpy_tensor at /tmp/pip-install-r9_ew1og/llama-cpp-python_858d7004c05a49ac876ecf41f562f687/vendor/llama.cpp/ggml-cuda.cu:11578
  cudaDeviceSynchronize()
GGML_ASSERT: /tmp/pip-install-r9_ew1og/llama-cpp-python_858d7004c05a49ac876ecf41f562f687/vendor/llama.cpp/ggml-cuda.cu:255: !"CUDA error"
!!!! kernel execution error. (m: 3072, n: 77, k: 1024, error: 13) 

Worked perfectly with heavy use before.

pseudotensor avatar Mar 06 '24 20:03 pseudotensor

I am a bit confused. I thought that @slaren solved this problem with 'llama : add pipeline parallelism support (https://github.com/ggerganov/llama.cpp/pull/6017)'? Or do you mean here something else?

@slaren said that when he was ready with 6017 he will fix the backend to release all CUDA memory, what is a big problem to many of us. I am not impatient just would like to understand what is happening.

zsogitbe avatar Mar 14 '24 13:03 zsogitbe

What I meant is that I will work on this after the pipeline parallelism is merged, which is what I am doing. It will still take a while to complete, as fixing this will require infrastructure changes in other parts of the code. Sorry for the confusion.

slaren avatar Mar 14 '24 13:03 slaren

I understand. Thank you that you care about this issue and that you will work on it! I have tried to solve it but I could not.

zsogitbe avatar Mar 14 '24 13:03 zsogitbe

#5396

DEVDEVIL007 avatar Mar 19 '24 06:03 DEVDEVIL007

https://github.com/ggerganov/llama.cpp/pull/6170 should fix this issue in the CUDA backend.

slaren avatar Mar 20 '24 10:03 slaren

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar May 05 '24 01:05 github-actions[bot]

Is this actually fixed?

martindevans avatar May 05 '24 01:05 martindevans

Some additional problems https://github.com/ggerganov/llama.cpp/issues/6909 maybe because of this fix?

zsogitbe avatar May 05 '24 04:05 zsogitbe

@martindevans It should be fixed, please report any issues with thread safety. For example, using multiple llama contexts simultaneously each with a different CUDA GPU on different threads should now be possible. CPU and Metal also should be thread-safe, other backends probably not.

slaren avatar May 08 '24 00:05 slaren

That's great to hear! I'll experiment with removing some of the locks we added into LLamaSharp and will report any bugs. Thanks.

martindevans avatar May 08 '24 00:05 martindevans

What about same GPU? Why isn't that thread safe too?

pseudotensor avatar May 08 '24 00:05 pseudotensor

It should also be thread-safe, but I don't expect that to be a very useful use case.

slaren avatar May 08 '24 01:05 slaren