Correlation between cpu threads and n-gpu-layers
I'm testing on the target board. This is the board info.
CPU
Architecture: aarch64
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: ARM
Model name: Cortex-A78
GPU
mali
I am measuring the degree of performance improvement by changing the n-gpu-layers, and when the number of CPU threads is 1, there is an improvement in performance as the number of gpu-layers increases. However, when the number of threads was increased to 4, there was no performance improvement at all as the increase in gpu-layers, and sometimes performance decreased. Since this is a case where CPU and GPU are used simultaneously, my estimate is as follows. (1) Data copy overhead between CPU and GPU, or (2) split workload synchronization issues between CPU and GPU.
However, "the data copy overhead between CPU and GPU" seems to be the same whether there is 1 thread or 4 threads. What do you think about this part? And if you have any other thoughts on this issue, I would be grateful if you could reply.
Hi, could you please detail your build command and the cli parameters you are using ?
The behavior is different depending on the GPU backend being used. Since it is a mali GPU, I assume that you are using OpenCL, is that correct?
@slaren: correct.
@phymbert built/tested llama.cpp with CLBlast. The example and parameters used are as follows, and in addition to this, I also tested it with main, llma-bench, etc.
# cpu-thread 1, ngl 0
$ main -m <MODEL> --no-mmap -t 1 -ngl 0
# cpu-thread 1, ngl 20
$ main -m <MODEL> --no-mmap -t 1 -ngl 20
# cpu-threads 4, ngl 0
$ main -m <MODEL> --no-mmap -t 4 -ngl 0
# cpu-threads 4, ngl 20
$ main -m <MODEL> --no-mmap -t 4 -ngl 20
With the OpenCL backend, the CPU threads are running in a spin lock while the matrix multiplication is running in the GPU. This is of course very bad for performance and power usage. Other backends do not suffer from this, but they can still suffer from reduced performance when using many threads with partial offloading due to the overhead of starting the threads. The OpenCL backend is very outdated, and if not updated to the newer backend infrastructure, at some point it will probably be removed entirely in favor of the Vulkan backend.
@slaren Thanks for explanations. I'm still curious how other backends (e.g cuBlas for cuda) doesn't suffer from this issue. Even in the case of cuda or vulkan, if the workload is divided between CPU and GPU, isn't it the same that the CPU running in spin lock while the GPU performs matrix multiplication? And I'm also curious about the overhead of data copy. There is a process of copying the calculation results from the GPU to the CPU's tensor object buffer, and this part is expected to be executed the same number of times regardless of whether there is one Thread or multiple Threads. If that's true, I guess this is not the main cause of performance degradation depending on the number of CPU threads, is this right?
The OpenCL backend hooks into the CPU backend and takes control of the execution of some ops. Other backends implement the ggml-backend interface, and the CPU backend isn't running at all while they are running. The data is only copied once, the OpenCL runs on the first thread and the rest of the threads are spinning.
The OpenCL backend hooks into the CPU backend and takes control of the execution of some ops. Other backends implement the ggml-backend interface, and the CPU backend isn't running at all while they are running.
First of all, thank you for your comment. But there is something a little confusing. My understanding was that '-ngl' is for specifying the number of offloads allows some tasks to be handled by the GPU rather than the CPU. Because not all tasks are offloaded to the GPU, even in cuda or vulkan backend also, CPU also work with GPU in parallel, and CPU might wait to get GPU results. I don't quite understand the part where it is mentioned that the CPU backend is not running in the case of cuda or vulkan is running.
The CPU and GPU backends do not work in parallel, they process each a different part of the model each in sequence. Generally what happens is that the input layer runs on the CPU, then the layers offloaded on the GPU, and then rest of the layers on the CPU. Additionally, with OpenCL only the matrix multiplications and mul/add ops with offloaded weights run on the GPU.
Thank you for kindly letting me know about the questions I was curious about.
Are the problems mentioned due to the lack of the OpenCL specification, or is it a problem with the current ggml-opencl implementation? It's a slightly different story, but in the case of tflite's gpu delegate, there are openGL and openCL implementations, and openCL has pretty good performance. Is there any way to solve the problem while using openCL like this? https://blog.tensorflow.org/2020/08/faster-mobile-gpu-inference-with-opencl.html
I don't think there is anything about OpenCL that would prevent creating a better backend implementation, it just hasn't been updated much since it was added to ggml.
This issue was closed because it has been inactive for 14 days since being marked as stale.
The CPU and GPU backends do not work in parallel, they process each a different part of the model each in sequence. Generally what happens is that the input layer runs on the CPU, then the layers offloaded on the GPU, and then rest of the layers on the CPU. Additionally, with OpenCL only the matrix multiplications and mul/add ops with offloaded weights run on the GPU.
I'm sorry to bother you. I'm trying to figure out if gpu would be used with -ngl 0. As what you said, could I assume that all operators are executed on the cpu when ngl is 0?