ollama
ollama copied to clipboard
Is the GPU working?
After running 'ollama run llama3:70b', the CPU and GPU utilization increased to 100%, and the model began to be transferred to memory and graphics memory, then decreased to 0%. Then a message was sent, and the model began to answer. The GPU only rose to 100% at the beginning and then immediately dropped to 0%, and only the CPU remained working. Is this normal?
@15731807423 what's the output of ollama ps? It should tell you how much of the model is on the GPU and how much is on the CPU.
@pdevine This should be the occupied RAM and VRAM. The utilization rate of GPU has always been 0% when answering.
NAME ID SIZE PROCESSOR UNTIL
llama3:70b be39eb53a197 41 GB 42%/58% CPU/GPU 4 minutes from now
NAME ID SIZE PROCESSOR UNTIL
llama3:latest a6990ed6be41 5.4 GB 100% GPU 4 minutes from now
@15731807423 looks like 70b is being partially offloaded, and 8b is fully running on the GPU. When you do /set verbose how many tokens / second are you getting? With llama3:latest I would expect about 120-125 toks/second w/ a 4090. 70b will be much, much, much slower because almost half of the model is on the CPU, and the fact that it's a huge model. You should be getting around 2-3 toks/sec, although it will vary depending on your CPU.
Here's my ollama ps output on the 4090:
$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3:70b be39eb53a197 41 GB 40%/60% CPU/GPU 4 minutes from now
@pdevine What I don't understand is that the utilization rate of the GPU has always been 0%. It only rises to 100% in an instant at the beginning, and then it will reach 0% in the next second, while the CPU continues to work until the answer is completed. Is that correct?
(base) PS C:\Windows\System32> ollama run llama3
/set verbose Set 'verbose' mode. 你好 😊 你好!我是 Chatbot,很高兴见到你!如果你需要帮助或想聊天,请随时问我。 😊
total duration: 5.8423836s load duration: 5.4839949s prompt eval count: 12 token(s) prompt eval duration: 17.113ms prompt eval rate: 701.22 tokens/s eval count: 34 token(s) eval duration: 334.75ms eval rate: 101.57 tokens/s
(base) PS C:\Windows\System32> ollama run llama3:70b
/set verbose Set 'verbose' mode. 你好 😊 Ni Hao! (您好) Welcome! How can I help you today? 🤔
total duration: 13.0373642s load duration: 6.6727ms prompt eval count: 12 token(s) prompt eval duration: 2.312915s prompt eval rate: 5.19 tokens/s eval count: 22 token(s) eval duration: 10.71453s eval rate: 2.05 tokens/s
I think might be related to https://github.com/ollama/ollama/issues/1651 ? It doesn't look like ollama is using the GPU on PopOS
It is using the GPU, but it's not particularly efficient at using it because the model is split across the CPU and GPU and the limitations of the computer (like slow memory). You can turn the GPU off entirely in the repl with:
>>> /set parameter num_gpu 0
Which should show you the difference in performance. You can also load a lower number of layers (i.e. /set parameter num_gpu 1) which will show offloading most of the layers in the model to the CPU. I believe the reason why the activity monitor shows the GPU not doing much has to do with the bandwidth to the GPU and the contention between system memory and the GPU itself. That said, it's possible that we can potentially eek more speed out of this in the future if we're more clever about how we load the model onto the GPU.
Back to CPU only (using num_gpu 0) I get:
total duration: 3m25.479681006s
load duration: 4.023984693s
prompt eval count: 208 token(s)
prompt eval duration: 41.733919s
prompt eval rate: 4.98 tokens/s
eval count: 259 token(s)
eval duration: 2m39.571141s
eval rate: 1.62 tokens/s
or roughly half the speed of the GPU.
To expand on what Patrick mentioned, the 42% of the model loaded on system memory and doing inference calculations on the CPU is significantly slower than the GPU, so the GPU is able to quickly accomplish it's calculations for each step in the inference, and then sits idle waiting for the CPU to catch up. The closer you can get to 100% on GPU, the better the performance will be. If you have further questions, let us know.