llama.cpp
llama.cpp copied to clipboard
ggml : become thread-safe
ref https://github.com/ggerganov/llama.cpp/discussions/499#discussioncomment-7478602
We should be able to run inference on multiple graphs, backends and devices in parallel. Currently, there are CUDA singletons that break this requirement and possibly there could be other problems.