ollama
                                
                                 ollama copied to clipboard
                                
                                    ollama copied to clipboard
                            
                            
                            
                        CUDA out of memory error on Windows for ollama run starts up
Hi there,
I just installed ollama 0.1.27 and tried to run gemma:2b but it suggest CUDA out of memory error. Could you please investigate and figure out root cause?
I'm using CPU i7-4700HQ with RAM 16G.
attached log and nvidia-smi report
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 531.41 Driver Version: 531.41 CUDA Version: 12.1 | |-----------------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce GTX 960M WDDM | 00000000:02:00.0 Off | N/A | | N/A 0C P0 N/A / N/A| 181MiB / 4096MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 272 C+G ...s (x86)\Mozilla Firefox\firefox.exe N/A | | 0 N/A N/A 4520 C+G ....0_x64__8wekyb3d8bbwe\YourPhone.exe N/A | | 0 N/A N/A 7580 C+G ....Experiences.TextInput.InputApp.exe N/A | | 0 N/A N/A 9940 C+G ...2txyewy\StartMenuExperienceHost.exe N/A | | 0 N/A N/A 11012 C+G ...t.LockApp_cw5n1h2txyewy\LockApp.exe N/A | | 0 N/A N/A 12428 C+G ...cal\Microsoft\OneDrive\OneDrive.exe N/A | | 0 N/A N/A 13100 C+G ...s (x86)\Mozilla Firefox\firefox.exe N/A | | 0 N/A N/A 13332 C+G ...guoyun\bin-7.1.3\NutstoreClient.exe N/A | +---------------------------------------------------------------------------------------+
log:
[GIN] 2024/02/29 - 23:47:32 | 200 | 32.7µs | 127.0.0.1 | HEAD "/" [GIN] 2024/02/29 - 23:47:32 | 200 | 1.2447ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/02/29 - 23:47:32 | 200 | 2.4218ms | 127.0.0.1 | POST "/api/show" time=2024-02-29T23:47:37.171+08:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-02-29T23:47:37.171+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-02-29T23:47:37.216+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll C:\WINDOWS\system32\nvml.dll]" time=2024-02-29T23:47:37.236+08:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-02-29T23:47:37.236+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-29T23:47:37.248+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.0" time=2024-02-29T23:47:37.248+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-29T23:47:37.252+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.0" time=2024-02-29T23:47:37.253+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-29T23:47:37.253+08:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to time=2024-02-29T23:47:37.328+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\bolun\AppData\Local\Temp\ollama625311207\cuda_v11.3\ext_server.dll" time=2024-02-29T23:47:37.329+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 960M, compute capability 5.0, VMM: yes llama_model_loader: loaded meta data with 21 key-value pairs and 164 tensors from C:\Users\bolun.ollama\models\blobs\sha256-c1864a5eb19305c40519da12cc543519e48a0697ecd30e15d5ac228644957d12 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = gemma-2b-it llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.block_count u32 = 18 llama_model_loader: - kv 4: gemma.embedding_length u32 = 2048 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 16384 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 8 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 1 llama_model_loader: - kv 8: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 9: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 14: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 15: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,256128] = ["
", " ", " ", " ", ... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,256128] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,256128] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - kv 20: general.file_type u32 = 2 llama_model_loader: - type f32: 37 tensors llama_model_loader: - type q4_0: 126 tensors llama_model_loader: - type q8_0: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 544/256128 vs 388/256128 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256128 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 8 llm_load_print_meta: n_head_kv = 1 llm_load_print_meta: n_layer = 18 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 16384 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 2B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 2.51 B llm_load_print_meta: model size = 1.56 GiB (5.34 BPW) llm_load_print_meta: general.name = gemma-2b-it llm_load_print_meta: BOS token = 2 ' ' llm_load_print_meta: EOS token = 1 ' ' llm_load_print_meta: UNK token = 3 ' ' llm_load_print_meta: PAD token = 0 ' ' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_tensors: ggml ctx size = 0.13 MiB llm_load_tensors: offloading 18 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 19/19 layers to GPU llm_load_tensors: CPU buffer size = 531.52 MiB llm_load_tensors: CUDA0 buffer size = 1594.93 MiB ..................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 36.00 MiB llama_new_context_with_model: KV self size = 36.00 MiB, K (f16): 18.00 MiB, V (f16): 18.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 9.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 504.25 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 4.00 MiB llama_new_context_with_model: graph splits (measure): 3 CUDA error: out of memory current device: 0, in function ggml_cuda_pool_malloc_vmm at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:7990 cuMemSetAccess(g_cuda_pool_addr[device] + g_cuda_pool_size[device], reserve_size, &access, 1) GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error" 
cc @dhiltgen
I am also experiencing the same error. Here is the error log:
time=2024-03-02T10:57:13.946+08:00 level=INFO source=images.go:710 msg="total blobs: 17"
time=2024-03-02T10:57:13.959+08:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-03-02T10:57:13.961+08:00 level=INFO source=routes.go:1019 msg="Listening on [::]:1123 (version 0.1.27)"
time=2024-03-02T10:57:13.961+08:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-02T10:57:14.141+08:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx2 cpu cuda_v11.3 cpu_avx]"
[GIN] 2024/03/02 - 10:57:14 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/03/02 - 10:57:14 | 200 |      2.3633ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/03/02 - 10:57:14 | 200 |      2.3207ms |       127.0.0.1 | POST     "/api/show"
time=2024-03-02T10:57:14.905+08:00 level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-03-02T10:57:14.905+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll"
time=2024-03-02T10:57:14.909+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll]"
time=2024-03-02T10:57:14.926+08:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected"
time=2024-03-02T10:57:14.926+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-02T10:57:14.941+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6"
time=2024-03-02T10:57:14.941+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-02T10:57:14.941+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6"
time=2024-03-02T10:57:14.941+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-02T10:57:14.941+08:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\Users\huan\AppData\Local\Temp\ollama2241795987\cuda_v11.3;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.3.1\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Users\huan\AppData\Local\Microsoft\WindowsApps;;C:\Users\huan\AppData\Local\Programs\Ollama"
time=2024-03-02T10:57:15.046+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\huan\AppData\Local\Temp\ollama2241795987\cuda_v11.3\ext_server.dll"
time=2024-03-02T10:57:15.047+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from C:\Users\huan.ollama\models\blobs\sha256-8934d96d3f08982e95922b2b7a2c626a1fe873d7c3b06e8e56d7bc0a1fef9246 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["", "", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,61249]   = ["▁ t", "e r", "i n", "▁ a", "e n...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 ''
llm_load_print_meta: EOS token        = 2 ''
llm_load_print_meta: UNK token        = 0 '
My spec: Ubuntu 22.04, 16.0 GiB RAM, GeForce 940MX (2048 MiB).
Using Gemma:2b, I face same CUDA OOM error when calling the Ollama API from my webapp. I do NOT face same error via Ollama run.
So I got Ollama working with Gemma from webapp by:
- put NO user instruction in the System message of API payload (or maybe remove the system message entirely, I have not tried)
- start the conversation with "Hello", any longer question like "Why is the sky blue" would not work
- then, I can ask any long question
Strange, but it worked for me. Note that I do not face similar issue with other LLM like Mistral.
BTW this is my first ever Github comment. Many many thanks to the great Ollama team!
Similar issue here, with a Ryzen 5700X, 32gb RAM, and dual GPUs:
time=2024-03-02T23:09:06.654-05:00 level=INFO source=images.go:710 msg="total blobs: 0"
time=2024-03-02T23:09:06.655-05:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0"
time=2024-03-02T23:09:06.655-05:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)"
time=2024-03-02T23:09:06.655-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-02T23:09:06.811-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cuda_v11.3 cpu_avx cpu_avx2 cpu]"
[GIN] 2024/03/02 - 23:09:23 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/03/02 - 23:09:23 | 404 |       528.3µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/03/02 - 23:09:24 | 200 |    492.9968ms |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/03/02 - 23:09:27 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/03/02 - 23:09:27 | 404 |            0s |       127.0.0.1 | POST     "/api/show"
time=2024-03-02T23:09:28.622-05:00 level=INFO source=download.go:136 msg="downloading e8a35b5937a5 in 42 100 MB part(s)"
time=2024-03-02T23:10:36.482-05:00 level=INFO source=download.go:136 msg="downloading 43070e2d4e53 in 1 11 KB part(s)"
time=2024-03-02T23:10:38.340-05:00 level=INFO source=download.go:136 msg="downloading e6836092461f in 1 42 B part(s)"
time=2024-03-02T23:10:41.345-05:00 level=INFO source=download.go:136 msg="downloading ed11eda7790d in 1 30 B part(s)"
time=2024-03-02T23:10:43.244-05:00 level=INFO source=download.go:136 msg="downloading f9b1e3196ecf in 1 483 B part(s)"
[GIN] 2024/03/02 - 23:10:47 | 200 |         1m20s |       127.0.0.1 | POST     "/api/pull"
[GIN] 2024/03/02 - 23:10:47 | 200 |       524.1µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/03/02 - 23:10:47 | 200 |       528.6µs |       127.0.0.1 | POST     "/api/show"
time=2024-03-02T23:10:47.816-05:00 level=INFO source=gpu.go:94 msg="Detecting GPU type"
time=2024-03-02T23:10:47.816-05:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll"
time=2024-03-02T23:10:47.840-05:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll]"
time=2024-03-02T23:10:47.855-05:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected"
time=2024-03-02T23:10:47.859-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-02T23:10:47.887-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2"
time=2024-03-02T23:10:47.887-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-02T23:10:47.887-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2"
time=2024-03-02T23:10:47.887-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-02T23:10:47.887-05:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\Users\RUCARA~1\AppData\Local\Temp\ollama991450673\cuda_v11.3;C:\Users\rucaradio\AppData\Local\Programs\Ollama;C:\Program Files\NVIDIA\CUDNN\v9.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;;;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files\Git\cmd;C:\Users\rucaradio\AppData\Roaming\nvm;C:\Program Files\nodejs;C:\Program Files\WindowsPowerShell\Scripts;C:\ProgramData\chocolatey\bin;;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.3.1\;C:\Program Files\Go\bin;C:\Program Files\PowerShell\7\;C:\Program Files\CMake\bin;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Users\rucaradio\.cargo\bin;C:\Users\rucaradio\scoop\shims;C:\Users\rucaradio\AppData\Local\Microsoft\WindowsApps;C:\Users\rucaradio\AppData\Local\Programs\Microsoft VS Code\bin;C:\ "
time=2024-03-02T23:10:48.341-05:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\RUCARA~1\AppData\Local\Temp\ollama991450673\cuda_v11.3\ext_server.dll"
time=2024-03-02T23:10:48.342-05:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Device 1: Quadro M6000, compute capability 5.2, VMM: yes
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\rucaradio.ollama\models\blobs\sha256-e8a35b5937a5e6d5c35d1f2a15f161e07eefe5e5bb0a3cdd42998ee79b057730 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 2
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["", "", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,58980]   = ["▁ t", "i n", "e r", "▁ a", "h e...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW)
llm_load_print_meta: general.name     = mistralai
llm_load_print_meta: BOS token        = 1 ''
llm_load_print_meta: EOS token        = 2 ''
llm_load_print_meta: UNK token        = 0 '", "", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,58980]   = ["▁ t", "i n", "e r", "▁ a", "h e...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW)
llm_load_print_meta: general.name     = mistralai
llm_load_print_meta: BOS token        = 1 ''
llm_load_print_meta: EOS token        = 2 ''
llm_load_print_meta: UNK token        = 0 '", "", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,58980]   = ["▁ t", "i n", "e r", "▁ a", "h e...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW)
llm_load_print_meta: general.name     = mistralai
llm_load_print_meta: BOS token        = 1 ''
llm_load_print_meta: EOS token        = 2 ''
llm_load_print_meta: UNK token        = 0 '
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 551.61 Driver Version: 551.61 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Quadro M6000 WDDM | 00000000:05:00.0 On | Off | | 27% 51C P8 28W / 250W | 646MiB / 12288MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3060 WDDM | 00000000:0B:00.0 On | N/A | | 0% 35C P8 8W / 170W | 118MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 4436 C+G ...on\122.0.2365.59\msedgewebview2.exe N/A | | 0 N/A N/A 6612 C+G ...on\122.0.2365.59\msedgewebview2.exe N/A | | 0 N/A N/A 6792 C+G C:\Windows\explorer.exe N/A | | 0 N/A N/A 8600 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A | | 0 N/A N/A 9404 C+G ...2txyewy\StartMenuExperienceHost.exe N/A |
(base) C:\newpdev\ollama>NVCC -V NVCC: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Wed_Nov_22_10:30:42_Pacific_Standard_Time_2023 Cuda compilation tools, release 12.3, V12.3.107 Build cuda_12.3.r12.3/compiler.33567101_0
time=2024-04-01T08:12:19.872+08:00 level=INFO source=gpu.go:115 msg="Detecting GPU type"
time=2024-04-01T08:12:19.872+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library cudart64_*.dll"
time=2024-04-01T08:12:19.881+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [C:\Users\Administrator\AppData\Local\Programs\Ollama\cudart64_110.dll c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cudart64_110.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cudart64_110.dll]"
time=2024-04-01T08:12:19.940+08:00 level=INFO source=gpu.go:120 msg="Nvidia GPU detected via cudart"
time=2024-04-01T08:12:19.941+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-01T08:12:20.082+08:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6"
time=2024-04-01T08:12:20.082+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-01T08:12:20.083+08:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6"
time=2024-04-01T08:12:20.083+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-01T08:12:20.083+08:00 level=INFO source=assets.go:108 msg="Updating PATH to C:\Users\ADMINI~1\AppData\Local\Temp\ollama2476353147\runners\cuda_v11.3;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\libnvvp;C:\Program Files (x86)\jdk/bin;D:\work\graalvm-jdk-17_windows-x64_bin\graalvm-jdk-17.0.9+11.1\bin;D:\WindowsVSC\VC\Tools\MSVC\14.36.32532\bin\Hostx64\x64\;C:\Program Files\PlasticSCM5\server;C:\Program Files\PlasticSCM5\client;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;D:\work\apache-tomcat-9.0.1-windows-x64\apache-tomcat-9.0.1\bin\;D:\work\apache-maven-3.8.8-bin\apache-maven-3.8.8\bin\;D:\work\gradle-8.2.1-all\gradle-8.2.1\bin;D:\work\apache-jmeter-5.5\bin;D:\work\w64devkit-1.19.0\w64devkit\bin;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files\MySQL\MySQL Server 8.0\bin;D:\Git\cmd;D:\python\;D:\nvm;C:\Program Files\nodejs;D:\work\visualvm_216\bin;D:\HashiCorp\Vagrant\bin;D:\weixin\微信web开发者工具\dll;D:\work\netcat-win32-1.12;D:\work\VMware-ovftool-4.5.0-20459872-win.x86_64\ovftool;D:\work\lu;D:\work\kotlin-compiler-1.9.22\kotlinc\bin;C:\Program Files\CMake\bin;C:\Program Files\NVIDIA Corporation\Nsight Compute 2020.3.0\;D:\miniconda3;D:\miniconda3\Library\mingw-w64\bin;D:\miniconda3\Library\usr\bin;D:\miniconda3\Library\bin;D:\miniconda3\Scripts;C:\Program Files\MySQL\MySQL Shell 8.0\bin\;C:\Users\Administrator\AppData\Local\Microsoft\WindowsApps;C:\Users\Administrator\AppData\Roaming\npm;D:\nvm;C:\Program Files\nodejs;D:\work\graalvm-jdk-17_windows-x64_bin\graalvm-jdk-17.0.9+11.1\bin\;D:\work\graalvm-jdk-17_windows-x64_bin\graalvm-jdk-17.0.9+11.1\jre\bin\;C:\Users\Administrator\AppData\Local\GitHubDesktop\bin;C:\Users\Administrator\.dotnet\tools;D:\work\mongosh\;;C:\Users\Administrator\AppData\Local\Programs\Ollama"
loading library C:\Users\ADMINI~1\AppData\Local\Temp\ollama2476353147\runners\cuda_v11.3\ext_server.dll
time=2024-04-01T08:12:20.099+08:00 level=INFO source=dyn_ext_server.go:87 msg="Loading Dynamic llm server: C:\Users\ADMINI~1\AppData\Local\Temp\ollama2476353147\runners\cuda_v11.3\ext_server.dll"
time=2024-04-01T08:12:20.100+08:00 level=INFO source=dyn_ext_server.go:147 msg="Initializing llama server"
llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from D:\ollama\blobs\sha256-456402914e838a953e0cf80caa6adbe75383d9e63584a964f504a7bbb8f7aad9 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = gemma-7b-it
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                     gemma.embedding_length u32              = 3072
llama_model_loader: - kv   4:                          gemma.block_count u32              = 28
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 24576
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 16
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv  10:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,256000]  = ["
I would suggest giving the latest release a try to see if that improves the situation. That said, these may ultimately be due to #4599 which I'm still working on.
Please upgrade to the latest version (0.1.45) and this should be resolved now for CUDA cards.