jetson-containers
jetson-containers copied to clipboard
Not sure if ollama:r36.2.0 is using GPU
Dear @dusty-nv , I pulled dustynv/ollama:r36.2.0 on jeston orin 32G DEV. run command: jetson-containers run --name ollama $(autotag ollama), the output are: [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
- using env: export GIN_MODE=release
- using code: gin.SetMode(gin.ReleaseMode)
[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-04-27T00:32:16.148Z level=INFO source=routes.go:1064 msg="Listening on [::]:11434 (version 0.0.0)" time=2024-04-27T00:32:16.149Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama359642117/runners time=2024-04-27T00:32:26.579Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cuda_v12]" time=2024-04-27T00:32:26.579Z level=INFO source=gpu.go:96 msg="Detecting GPUs" time=2024-04-27T00:32:26.657Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama359642117/runners/cuda_v12/libcudart.so.12 count=1 time=2024-04-27T00:32:26.658Z level=INFO source=cpu_common.go:18 msg="CPU does not have vector extensions"
and in container I tried several models: llama3:latest, llava:34b
I checked GPU usage by command: nvidia-smi, the output always are:
Sat Apr 27 08:27:01 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 540.2.0 Driver Version: N/A CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Orin (nvgpu) N/A | N/A N/A | N/A |
| N/A N/A N/A N/A / N/A | Not Supported | N/A N/A |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ Seems GPU is not used by ollama?
the token output of llama3:latest is quite fast, but llava:34b is quite slow and the CPU usage of llava:34b is quite high than llama3.