ollama icon indicating copy to clipboard operation
ollama copied to clipboard

Is the GPU working?

Open 15731807423 opened this issue 1 year ago • 6 comments

微信截图_20240520112116

After running 'ollama run llama3:70b', the CPU and GPU utilization increased to 100%, and the model began to be transferred to memory and graphics memory, then decreased to 0%. Then a message was sent, and the model began to answer. The GPU only rose to 100% at the beginning and then immediately dropped to 0%, and only the CPU remained working. Is this normal?

15731807423 avatar May 20 '24 03:05 15731807423

@15731807423 what's the output of ollama ps? It should tell you how much of the model is on the GPU and how much is on the CPU.

pdevine avatar May 20 '24 05:05 pdevine

@pdevine This should be the occupied RAM and VRAM. The utilization rate of GPU has always been 0% when answering.

NAME            ID              SIZE    PROCESSOR       UNTIL
llama3:70b      be39eb53a197    41 GB   42%/58% CPU/GPU 4 minutes from now

NAME            ID              SIZE    PROCESSOR       UNTIL
llama3:latest   a6990ed6be41    5.4 GB  100% GPU        4 minutes from now

15731807423 avatar May 20 '24 06:05 15731807423

@15731807423 looks like 70b is being partially offloaded, and 8b is fully running on the GPU. When you do /set verbose how many tokens / second are you getting? With llama3:latest I would expect about 120-125 toks/second w/ a 4090. 70b will be much, much, much slower because almost half of the model is on the CPU, and the fact that it's a huge model. You should be getting around 2-3 toks/sec, although it will vary depending on your CPU.

Here's my ollama ps output on the 4090:

$ ollama ps
NAME      	ID          	SIZE 	PROCESSOR      	UNTIL
llama3:70b	be39eb53a197	41 GB	40%/60% CPU/GPU	4 minutes from now

pdevine avatar May 20 '24 07:05 pdevine

@pdevine What I don't understand is that the utilization rate of the GPU has always been 0%. It only rises to 100% in an instant at the beginning, and then it will reach 0% in the next second, while the CPU continues to work until the answer is completed. Is that correct?

(base) PS C:\Windows\System32> ollama run llama3

/set verbose Set 'verbose' mode. 你好 😊 你好!我是 Chatbot,很高兴见到你!如果你需要帮助或想聊天,请随时问我。 😊

total duration: 5.8423836s load duration: 5.4839949s prompt eval count: 12 token(s) prompt eval duration: 17.113ms prompt eval rate: 701.22 tokens/s eval count: 34 token(s) eval duration: 334.75ms eval rate: 101.57 tokens/s

(base) PS C:\Windows\System32> ollama run llama3:70b

/set verbose Set 'verbose' mode. 你好 😊 Ni Hao! (您好) Welcome! How can I help you today? 🤔

total duration: 13.0373642s load duration: 6.6727ms prompt eval count: 12 token(s) prompt eval duration: 2.312915s prompt eval rate: 5.19 tokens/s eval count: 22 token(s) eval duration: 10.71453s eval rate: 2.05 tokens/s

15731807423 avatar May 20 '24 07:05 15731807423

I think might be related to https://github.com/ollama/ollama/issues/1651 ? It doesn't look like ollama is using the GPU on PopOS

frederickjjoubert avatar May 20 '24 22:05 frederickjjoubert

It is using the GPU, but it's not particularly efficient at using it because the model is split across the CPU and GPU and the limitations of the computer (like slow memory). You can turn the GPU off entirely in the repl with:

>>> /set parameter num_gpu 0

Which should show you the difference in performance. You can also load a lower number of layers (i.e. /set parameter num_gpu 1) which will show offloading most of the layers in the model to the CPU. I believe the reason why the activity monitor shows the GPU not doing much has to do with the bandwidth to the GPU and the contention between system memory and the GPU itself. That said, it's possible that we can potentially eek more speed out of this in the future if we're more clever about how we load the model onto the GPU.

Back to CPU only (using num_gpu 0) I get:

total duration:       3m25.479681006s
load duration:        4.023984693s
prompt eval count:    208 token(s)
prompt eval duration: 41.733919s
prompt eval rate:     4.98 tokens/s
eval count:           259 token(s)
eval duration:        2m39.571141s
eval rate:            1.62 tokens/s

or roughly half the speed of the GPU.

pdevine avatar May 20 '24 22:05 pdevine

To expand on what Patrick mentioned, the 42% of the model loaded on system memory and doing inference calculations on the CPU is significantly slower than the GPU, so the GPU is able to quickly accomplish it's calculations for each step in the inference, and then sits idle waiting for the CPU to catch up. The closer you can get to 100% on GPU, the better the performance will be. If you have further questions, let us know.

dhiltgen avatar May 22 '24 21:05 dhiltgen