llama.cpp Misc. bug: OpenCL context reference counting is wrong in llama-bench

Name and Version

version: 6024 (834c0ea2) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-bench

Command line

./llama-bench -m /home/rob/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p 32,64 -ngl 33 -r 0

Problem description & steps to reproduce

The crux of the problem is that an OpenCL context is created by ggml_opencl_probe_devices, from ggml_backend_load_all called at llama-bench.cpp:1838.

However, the OpenCL context is released by ggml_cl2_free, from llama_free called at llama-bench.cpp:2189, which is inside the for-loop. This means that on the next iteration of the for-loop there is no valid OpenCL context and drivers will return errors.

I'm not sure if the problem is with where the OpenCL backend releases the context or whether llama-bench shouldn't be calling llama_free inside the loop, or of llama-bench should take another reference to that backend context inside the loop.

Any advice on the best way to fix the issue would be much appreciated.

First Bad Commit

No response

Relevant log output

Oct 16 '25 10:10 robquill

@max-krasnyansky , @lhez as people who seem to know about the OpenCL backend, do you have an opinion on the right way to fix this?

Nov 14 '25 14:11 robquill

@max-krasnyansky , @lhez as people who seem to know about the OpenCL backend, do you have an opinion on the right way to fix this?

Oh. Missed this one earlier. Will try to reproduce locally and report back.

Nov 16 '25 02:11 max-krasnyansky

@robquill Could you share the error message you got as well as your environment (which GPU, driver version)?

Nov 16 '25 04:11 lhez

I did this on a PowerVR GPU, where the extra releases actually cause a problem. I tested just now on my Intel GPU and llama-bench runs correctly. However, I have been able to show the problem using the OpenCL Intercept Layer (https://github.com/intel/opencl-intercept-layer).

export CLI_LeakChecking=1
~/git/opencl-intercept-layer/build/cliloader/cliloader ./bin/llama-bench -m /home/rob/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p 32,64 -ngl 33 -r 0

gives:

Leak Checking:

Unexpected counts for type cl_context!
    Number of Allocations: 1
    Number of Retains:     0
    Number of Releases:    3
...

Nov 17 '25 11:11 robquill

I did this on a PowerVR GPU, where the extra releases actually cause a problem. I tested just now on my Intel GPU and llama-bench runs correctly. However, I have been able to show the problem using the OpenCL Intercept Layer (https://github.com/intel/opencl-intercept-layer).
export CLI_LeakChecking=1
~/git/opencl-intercept-layer/build/cliloader/cliloader ./bin/llama-bench -m /home/rob/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p 32,64 -ngl 33 -r 0
gives:
Leak Checking:

Unexpected counts for type cl_context!
    Number of Allocations: 1
    Number of Retains:     0
    Number of Releases:    3
...

Thank you for the details! I have setup for OpenCL Intercept Layer. I will reproduce on my end and investigate.

Nov 17 '25 18:11 lhez