ollama CUDA out of memory error on Windows for ollama run starts up

Hi there,

I just installed ollama 0.1.27 and tried to run gemma:2b but it suggest CUDA out of memory error. Could you please investigate and figure out root cause?

I'm using CPU i7-4700HQ with RAM 16G.

attached log and nvidia-smi report

+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 531.41 Driver Version: 531.41 CUDA Version: 12.1 | |-----------------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce GTX 960M WDDM | 00000000:02:00.0 Off | N/A | | N/A 0C P0 N/A / N/A| 181MiB / 4096MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 272 C+G ...s (x86)\Mozilla Firefox\firefox.exe N/A | | 0 N/A N/A 4520 C+G ....0_x64__8wekyb3d8bbwe\YourPhone.exe N/A | | 0 N/A N/A 7580 C+G ....Experiences.TextInput.InputApp.exe N/A | | 0 N/A N/A 9940 C+G ...2txyewy\StartMenuExperienceHost.exe N/A | | 0 N/A N/A 11012 C+G ...t.LockApp_cw5n1h2txyewy\LockApp.exe N/A | | 0 N/A N/A 12428 C+G ...cal\Microsoft\OneDrive\OneDrive.exe N/A | | 0 N/A N/A 13100 C+G ...s (x86)\Mozilla Firefox\firefox.exe N/A | | 0 N/A N/A 13332 C+G ...guoyun\bin-7.1.3\NutstoreClient.exe N/A | +---------------------------------------------------------------------------------------+

log:

[GIN] 2024/02/29 - 23:47:32 | 200 | 32.7µs | 127.0.0.1 | HEAD "/" [GIN] 2024/02/29 - 23:47:32 | 200 | 1.2447ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/02/29 - 23:47:32 | 200 | 2.4218ms | 127.0.0.1 | POST "/api/show" time=2024-02-29T23:47:37.171+08:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-02-29T23:47:37.171+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-02-29T23:47:37.216+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll C:\WINDOWS\system32\nvml.dll]" time=2024-02-29T23:47:37.236+08:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-02-29T23:47:37.236+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-29T23:47:37.248+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.0" time=2024-02-29T23:47:37.248+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-29T23:47:37.252+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.0" time=2024-02-29T23:47:37.253+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-29T23:47:37.253+08:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to time=2024-02-29T23:47:37.328+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\bolun\AppData\Local\Temp\ollama625311207\cuda_v11.3\ext_server.dll" time=2024-02-29T23:47:37.329+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 960M, compute capability 5.0, VMM: yes llama_model_loader: loaded meta data with 21 key-value pairs and 164 tensors from C:\Users\bolun.ollama\models\blobs\sha256-c1864a5eb19305c40519da12cc543519e48a0697ecd30e15d5ac228644957d12 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = gemma-2b-it llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.block_count u32 = 18 llama_model_loader: - kv 4: gemma.embedding_length u32 = 2048 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 16384 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 8 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 1 llama_model_loader: - kv 8: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 9: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 14: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 15: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,256128] = ["", "", "", "", ... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,256128] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,256128] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - kv 20: general.file_type u32 = 2 llama_model_loader: - type f32: 37 tensors llama_model_loader: - type q4_0: 126 tensors llama_model_loader: - type q8_0: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 544/256128 vs 388/256128 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256128 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 8 llm_load_print_meta: n_head_kv = 1 llm_load_print_meta: n_layer = 18 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 16384 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 2B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 2.51 B llm_load_print_meta: model size = 1.56 GiB (5.34 BPW) llm_load_print_meta: general.name = gemma-2b-it llm_load_print_meta: BOS token = 2 '' llm_load_print_meta: EOS token = 1 '' llm_load_print_meta: UNK token = 3 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_tensors: ggml ctx size = 0.13 MiB llm_load_tensors: offloading 18 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 19/19 layers to GPU llm_load_tensors: CPU buffer size = 531.52 MiB llm_load_tensors: CUDA0 buffer size = 1594.93 MiB ..................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 36.00 MiB llama_new_context_with_model: KV self size = 36.00 MiB, K (f16): 18.00 MiB, V (f16): 18.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 9.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 504.25 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 4.00 MiB llama_new_context_with_model: graph splits (measure): 3 CUDA error: out of memory current device: 0, in function ggml_cuda_pool_malloc_vmm at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:7990 cuMemSetAccess(g_cuda_pool_addr[device] + g_cuda_pool_size[device], reserve_size, &access, 1) GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error"

Feb 29 '24 16:02 boluny

cc @dhiltgen

Mar 01 '24 01:03 pdevine

I am also experiencing the same error. Here is the error log: time=2024-03-02T10:57:13.946+08:00 level=INFO source=images.go:710 msg="total blobs: 17" time=2024-03-02T10:57:13.959+08:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-02T10:57:13.961+08:00 level=INFO source=routes.go:1019 msg="Listening on [::]:1123 (version 0.1.27)" time=2024-03-02T10:57:13.961+08:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-02T10:57:14.141+08:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx2 cpu cuda_v11.3 cpu_avx]" [GIN] 2024/03/02 - 10:57:14 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/02 - 10:57:14 | 200 | 2.3633ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/02 - 10:57:14 | 200 | 2.3207ms | 127.0.0.1 | POST "/api/show" time=2024-03-02T10:57:14.905+08:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-03-02T10:57:14.905+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-03-02T10:57:14.909+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll]" time=2024-03-02T10:57:14.926+08:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-03-02T10:57:14.926+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:14.941+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-02T10:57:14.941+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:14.941+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-02T10:57:14.941+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:14.941+08:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\Users\huan\AppData\Local\Temp\ollama2241795987\cuda_v11.3;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.3.1\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Users\huan\AppData\Local\Microsoft\WindowsApps;;C:\Users\huan\AppData\Local\Programs\Ollama" time=2024-03-02T10:57:15.046+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\huan\AppData\Local\Temp\ollama2241795987\cuda_v11.3\ext_server.dll" time=2024-03-02T10:57:15.047+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from C:\Users\huan.ollama\models\blobs\sha256-8934d96d3f08982e95922b2b7a2c626a1fe873d7c3b06e8e56d7bc0a1fef9246 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 11008 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 6.74 B llm_load_print_meta: model size = 3.56 GiB (4.54 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '~~' llm_load_print_meta: EOS token = 2 '~~' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.22 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: CUDA0 buffer size = 3577.56 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 3 time=2024-03-02T10:57:27.527+08:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" [GIN] 2024/03/02 - 10:57:27 | 200 | 13.1901436s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/03/02 - 10:57:42 | 200 | 453.2522ms | 127.0.0.1 | POST "/api/chat" time=2024-03-02T10:57:56.201+08:00 level=INFO source=routes.go:78 msg="changing loaded model" time=2024-03-02T10:57:58.660+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:58.660+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-02T10:57:58.660+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:58.660+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-02T10:57:58.660+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:58.660+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\huan\AppData\Local\Temp\ollama2241795987\cuda_v11.3\ext_server.dll" time=2024-03-02T10:57:58.660+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from C:\Users\huan.ollama\models\blobs\sha256-456402914e838a953e0cf80caa6adbe75383d9e63584a964f504a7bbb8f7aad9 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = gemma-7b-it llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.embedding_length u32 = 3072 llama_model_loader: - kv 4: gemma.block_count u32 = 28 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 24576 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - kv 23: general.file_type u32 = 2 llama_model_loader: - type f32: 57 tensors llama_model_loader: - type q4_0: 196 tensors llama_model_loader: - type q8_0: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_rot = 192 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 24576 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.54 B llm_load_print_meta: model size = 4.84 GiB (4.87 BPW) llm_load_print_meta: general.name = gemma-7b-it llm_load_print_meta: BOS token = 2 '' llm_load_print_meta: EOS token = 1 '' llm_load_print_meta: UNK token = 3 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_tensors: ggml ctx size = 0.19 MiB llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: CPU buffer size = 796.88 MiB llm_load_tensors: CUDA0 buffer size = 4955.54 MiB ........................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 11.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 506.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 6.00 MiB llama_new_context_with_model: graph splits (measure): 3 time=2024-03-02T10:58:13.011+08:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" CUDA error: out of memory current device: 0, in function ggml_cuda_pool_malloc_vmm at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:7990 cuMemSetAccess(g_cuda_pool_addr[device] + g_cuda_pool_size[device], reserve_size, &access, 1) GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error"

Mar 02 '24 03:03 hbqclh

My spec: Ubuntu 22.04, 16.0 GiB RAM, GeForce 940MX (2048 MiB).

Using Gemma:2b, I face same CUDA OOM error when calling the Ollama API from my webapp. I do NOT face same error via Ollama run.

So I got Ollama working with Gemma from webapp by:

put NO user instruction in the System message of API payload (or maybe remove the system message entirely, I have not tried)
start the conversation with "Hello", any longer question like "Why is the sky blue" would not work
then, I can ask any long question

Strange, but it worked for me. Note that I do not face similar issue with other LLM like Mistral.

BTW this is my first ever Github comment. Many many thanks to the great Ollama team!

Mar 02 '24 12:03 trandhbao

Similar issue here, with a Ryzen 5700X, 32gb RAM, and dual GPUs:

time=2024-03-02T23:09:06.654-05:00 level=INFO source=images.go:710 msg="total blobs: 0" time=2024-03-02T23:09:06.655-05:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-02T23:09:06.655-05:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)" time=2024-03-02T23:09:06.655-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-02T23:09:06.811-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cuda_v11.3 cpu_avx cpu_avx2 cpu]" [GIN] 2024/03/02 - 23:09:23 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/02 - 23:09:23 | 404 | 528.3µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/02 - 23:09:24 | 200 | 492.9968ms | 127.0.0.1 | POST "/api/pull" [GIN] 2024/03/02 - 23:09:27 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/02 - 23:09:27 | 404 | 0s | 127.0.0.1 | POST "/api/show" time=2024-03-02T23:09:28.622-05:00 level=INFO source=download.go:136 msg="downloading e8a35b5937a5 in 42 100 MB part(s)" time=2024-03-02T23:10:36.482-05:00 level=INFO source=download.go:136 msg="downloading 43070e2d4e53 in 1 11 KB part(s)" time=2024-03-02T23:10:38.340-05:00 level=INFO source=download.go:136 msg="downloading e6836092461f in 1 42 B part(s)" time=2024-03-02T23:10:41.345-05:00 level=INFO source=download.go:136 msg="downloading ed11eda7790d in 1 30 B part(s)" time=2024-03-02T23:10:43.244-05:00 level=INFO source=download.go:136 msg="downloading f9b1e3196ecf in 1 483 B part(s)" [GIN] 2024/03/02 - 23:10:47 | 200 | 1m20s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/03/02 - 23:10:47 | 200 | 524.1µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/02 - 23:10:47 | 200 | 528.6µs | 127.0.0.1 | POST "/api/show" time=2024-03-02T23:10:47.816-05:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-03-02T23:10:47.816-05:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-03-02T23:10:47.840-05:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll]" time=2024-03-02T23:10:47.855-05:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-03-02T23:10:47.859-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T23:10:47.887-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-02T23:10:47.887-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T23:10:47.887-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-02T23:10:47.887-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T23:10:47.887-05:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\Users\RUCARA~1\AppData\Local\Temp\ollama991450673\cuda_v11.3;C:\Users\rucaradio\AppData\Local\Programs\Ollama;C:\Program Files\NVIDIA\CUDNN\v9.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;;;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files\Git\cmd;C:\Users\rucaradio\AppData\Roaming\nvm;C:\Program Files\nodejs;C:\Program Files\WindowsPowerShell\Scripts;C:\ProgramData\chocolatey\bin;;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.3.1\;C:\Program Files\Go\bin;C:\Program Files\PowerShell\7\;C:\Program Files\CMake\bin;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Users\rucaradio\.cargo\bin;C:\Users\rucaradio\scoop\shims;C:\Users\rucaradio\AppData\Local\Microsoft\WindowsApps;C:\Users\rucaradio\AppData\Local\Programs\Microsoft VS Code\bin;C:\ " time=2024-03-02T23:10:48.341-05:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\RUCARA~1\AppData\Local\Temp\ollama991450673\cuda_v11.3\ext_server.dll" time=2024-03-02T23:10:48.342-05:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Device 1: Quadro M6000, compute capability 5.2, VMM: yes llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\rucaradio.ollama\models\blobs\sha256-e8a35b5937a5e6d5c35d1f2a15f161e07eefe5e5bb0a3cdd42998ee79b057730 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,58980] = ["▁ t", "i n", "e r", "▁ a", "h e... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) llm_load_print_meta: general.name = mistralai llm_load_print_meta: BOS token = 1 '~~' llm_load_print_meta: EOS token = 2 '~~' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.33 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: CUDA0 buffer size = 1989.53 MiB llm_load_tensors: CUDA1 buffer size = 1858.02 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 136.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 120.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA1 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 5 CUDA error: unspecified launch failure current device: 1, in function ggml_backend_cuda_buffer_cpy_tensor at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:10953 cudaDeviceSynchronize() GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error" time=2024-03-03T01:54:03.879-05:00 level=INFO source=images.go:710 msg="total blobs: 5" time=2024-03-03T01:54:03.884-05:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-03T01:54:03.885-05:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)" time=2024-03-03T01:54:03.885-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-03T01:54:04.032-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11.3 cpu]" [GIN] 2024/03/03 - 01:54:04 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/03 - 01:54:04 | 200 | 14.8896ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/03 - 01:54:04 | 200 | 505.4µs | 127.0.0.1 | POST "/api/show" time=2024-03-03T01:54:04.874-05:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-03-03T01:54:04.874-05:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-03-03T01:54:04.889-05:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll]" time=2024-03-03T01:54:04.895-05:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-03-03T01:54:04.907-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:54:04.942-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-03T01:54:04.942-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:54:04.942-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-03T01:54:04.942-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:54:04.942-05:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\Users\RUCARA~1\AppData\Local\Temp\ollama2071667329\cuda_v11.3;C:\Users\rucaradio\AppData\Local\Programs\Ollama;C:\Program Files\NVIDIA\CUDNN\v9.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;;;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files\Git\cmd;C:\Users\rucaradio\AppData\Roaming\nvm;C:\Program Files\nodejs;C:\Program Files\WindowsPowerShell\Scripts;C:\ProgramData\chocolatey\bin;;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.3.1\;C:\Program Files\Go\bin;C:\Program Files\PowerShell\7\;C:\Program Files\CMake\bin;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Users\rucaradio\.cargo\bin;C:\Users\rucaradio\scoop\shims;C:\Users\rucaradio\AppData\Local\Microsoft\WindowsApps;C:\Users\rucaradio\AppData\Local\Programs\Microsoft VS Code\bin;C:\ " time=2024-03-03T01:54:05.435-05:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\RUCARA~1\AppData\Local\Temp\ollama2071667329\cuda_v11.3\ext_server.dll" time=2024-03-03T01:54:05.435-05:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Device 1: Quadro M6000, compute capability 5.2, VMM: yes llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\rucaradio.ollama\models\blobs\sha256-e8a35b5937a5e6d5c35d1f2a15f161e07eefe5e5bb0a3cdd42998ee79b057730 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,58980] = ["▁ t", "i n", "e r", "▁ a", "h e... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) llm_load_print_meta: general.name = mistralai llm_load_print_meta: BOS token = 1 '~~' llm_load_print_meta: EOS token = 2 '~~' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.33 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: CUDA0 buffer size = 1989.53 MiB llm_load_tensors: CUDA1 buffer size = 1858.02 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 136.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 120.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA1 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 5 CUDA error: unspecified launch failure current device: 1, in function ggml_backend_cuda_buffer_cpy_tensor at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:10953 cudaDeviceSynchronize() GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error" time=2024-03-03T01:55:31.714-05:00 level=INFO source=images.go:710 msg="total blobs: 5" time=2024-03-03T01:55:31.720-05:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-03T01:55:31.722-05:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)" time=2024-03-03T01:55:31.722-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-03T01:55:31.881-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu cpu_avx2 cuda_v11.3 cpu_avx]" [GIN] 2024/03/03 - 01:59:06 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/03 - 01:59:06 | 200 | 18.2741ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/03 - 01:59:06 | 200 | 549.2µs | 127.0.0.1 | POST "/api/show" time=2024-03-03T01:59:07.250-05:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-03-03T01:59:07.250-05:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-03-03T01:59:07.275-05:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll]" time=2024-03-03T01:59:07.300-05:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-03-03T01:59:07.302-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:59:07.332-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-03T01:59:07.332-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:59:07.332-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-03T01:59:07.332-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:59:07.332-05:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\Users\RUCARA~1\AppData\Local\Temp\ollama4153122201\cuda_v11.3;C:\Program Files\NVIDIA\CUDNN\v9.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;;;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files\Git\cmd;C:\Users\rucaradio\AppData\Roaming\nvm;C:\Program Files\nodejs;C:\Program Files\WindowsPowerShell\Scripts;C:\ProgramData\chocolatey\bin;;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.3.1\;C:\Program Files\Go\bin;C:\Program Files\PowerShell\7\;C:\Program Files\CMake\bin;C:\Program Files\NVIDIA\CUDNN\v9.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;;;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files\Git\cmd;C:\Users\rucaradio\AppData\Roaming\nvm;C:\Program Files\nodejs;C:\Program Files\WindowsPowerShell\Scripts;C:\ProgramData\chocolatey\bin;;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.3.1\;C:\Program Files\Go\bin;C:\Program Files\PowerShell\7\;C:\Program Files\CMake\bin;C:\Users\rucaradio\.cargo\bin;C:\Users\rucaradio\scoop\shims;C:\Users\rucaradio\AppData\Local\Microsoft\WindowsApps;C:\Users\rucaradio\AppData\Local\Programs\Microsoft VS Code\bin;C:\Users\rucaradio\AppData\Roaming\nvm;C:\Program Files\nodejs;C:\;C:\Users\rucaradio\AppData\Local\Programs\Ollama" time=2024-03-03T01:59:07.843-05:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\RUCARA~1\AppData\Local\Temp\ollama4153122201\cuda_v11.3\ext_server.dll" time=2024-03-03T01:59:07.843-05:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Device 1: Quadro M6000, compute capability 5.2, VMM: yes llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\rucaradio.ollama\models\blobs\sha256-e8a35b5937a5e6d5c35d1f2a15f161e07eefe5e5bb0a3cdd42998ee79b057730 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,58980] = ["▁ t", "i n", "e r", "▁ a", "h e... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) llm_load_print_meta: general.name = mistralai llm_load_print_meta: BOS token = 1 '~~' llm_load_print_meta: EOS token = 2 '~~' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.33 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: CUDA0 buffer size = 1989.53 MiB llm_load_tensors: CUDA1 buffer size = 1858.02 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 136.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 120.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA1 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 5 CUDA error: unspecified launch failure current device: 1, in function ggml_backend_cuda_buffer_cpy_tensor at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:10953 cudaDeviceSynchronize() GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error"

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 551.61 Driver Version: 551.61 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Quadro M6000 WDDM | 00000000:05:00.0 On | Off | | 27% 51C P8 28W / 250W | 646MiB / 12288MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3060 WDDM | 00000000:0B:00.0 On | N/A | | 0% 35C P8 8W / 170W | 118MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 4436 C+G ...on\122.0.2365.59\msedgewebview2.exe N/A | | 0 N/A N/A 6612 C+G ...on\122.0.2365.59\msedgewebview2.exe N/A | | 0 N/A N/A 6792 C+G C:\Windows\explorer.exe N/A | | 0 N/A N/A 8600 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A | | 0 N/A N/A 9404 C+G ...2txyewy\StartMenuExperienceHost.exe N/A |

(base) C:\newpdev\ollama>NVCC -V NVCC: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Wed_Nov_22_10:30:42_Pacific_Standard_Time_2023 Cuda compilation tools, release 12.3, V12.3.107 Build cuda_12.3.r12.3/compiler.33567101_0

Mar 03 '24 07:03 patrickdeluca

time=2024-04-01T08:12:19.872+08:00 level=INFO source=gpu.go:115 msg="Detecting GPU type" time=2024-04-01T08:12:19.872+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library cudart64_*.dll" time=2024-04-01T08:12:19.881+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [C:\Users\Administrator\AppData\Local\Programs\Ollama\cudart64_110.dll c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cudart64_110.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cudart64_110.dll]" time=2024-04-01T08:12:19.940+08:00 level=INFO source=gpu.go:120 msg="Nvidia GPU detected via cudart" time=2024-04-01T08:12:19.941+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-01T08:12:20.082+08:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6" time=2024-04-01T08:12:20.082+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-01T08:12:20.083+08:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6" time=2024-04-01T08:12:20.083+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-01T08:12:20.083+08:00 level=INFO source=assets.go:108 msg="Updating PATH to C:\Users\ADMINI~1\AppData\Local\Temp\ollama2476353147\runners\cuda_v11.3;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\libnvvp;C:\Program Files (x86)\jdk/bin;D:\work\graalvm-jdk-17_windows-x64_bin\graalvm-jdk-17.0.9+11.1\bin;D:\WindowsVSC\VC\Tools\MSVC\14.36.32532\bin\Hostx64\x64\;C:\Program Files\PlasticSCM5\server;C:\Program Files\PlasticSCM5\client;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;D:\work\apache-tomcat-9.0.1-windows-x64\apache-tomcat-9.0.1\bin\;D:\work\apache-maven-3.8.8-bin\apache-maven-3.8.8\bin\;D:\work\gradle-8.2.1-all\gradle-8.2.1\bin;D:\work\apache-jmeter-5.5\bin;D:\work\w64devkit-1.19.0\w64devkit\bin;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files\MySQL\MySQL Server 8.0\bin;D:\Git\cmd;D:\python\;D:\nvm;C:\Program Files\nodejs;D:\work\visualvm_216\bin;D:\HashiCorp\Vagrant\bin;D:\weixin\微信web开发者工具\dll;D:\work\netcat-win32-1.12;D:\work\VMware-ovftool-4.5.0-20459872-win.x86_64\ovftool;D:\work\lu;D:\work\kotlin-compiler-1.9.22\kotlinc\bin;C:\Program Files\CMake\bin;C:\Program Files\NVIDIA Corporation\Nsight Compute 2020.3.0\;D:\miniconda3;D:\miniconda3\Library\mingw-w64\bin;D:\miniconda3\Library\usr\bin;D:\miniconda3\Library\bin;D:\miniconda3\Scripts;C:\Program Files\MySQL\MySQL Shell 8.0\bin\;C:\Users\Administrator\AppData\Local\Microsoft\WindowsApps;C:\Users\Administrator\AppData\Roaming\npm;D:\nvm;C:\Program Files\nodejs;D:\work\graalvm-jdk-17_windows-x64_bin\graalvm-jdk-17.0.9+11.1\bin\;D:\work\graalvm-jdk-17_windows-x64_bin\graalvm-jdk-17.0.9+11.1\jre\bin\;C:\Users\Administrator\AppData\Local\GitHubDesktop\bin;C:\Users\Administrator\.dotnet\tools;D:\work\mongosh\;;C:\Users\Administrator\AppData\Local\Programs\Ollama" loading library C:\Users\ADMINI~1\AppData\Local\Temp\ollama2476353147\runners\cuda_v11.3\ext_server.dll time=2024-04-01T08:12:20.099+08:00 level=INFO source=dyn_ext_server.go:87 msg="Loading Dynamic llm server: C:\Users\ADMINI~1\AppData\Local\Temp\ollama2476353147\runners\cuda_v11.3\ext_server.dll" time=2024-04-01T08:12:20.100+08:00 level=INFO source=dyn_ext_server.go:147 msg="Initializing llama server" llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from D:\ollama\blobs\sha256-456402914e838a953e0cf80caa6adbe75383d9e63584a964f504a7bbb8f7aad9 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = gemma-7b-it llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.embedding_length u32 = 3072 llama_model_loader: - kv 4: gemma.block_count u32 = 28 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 24576 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - kv 23: general.file_type u32 = 2 llama_model_loader: - type f32: 57 tensors llama_model_loader: - type q4_0: 196 tensors llama_model_loader: - type q8_0: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_rot = 192 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 24576 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.54 B llm_load_print_meta: model size = 4.84 GiB (4.87 BPW) llm_load_print_meta: general.name = gemma-7b-it llm_load_print_meta: BOS token = 2 '' llm_load_print_meta: EOS token = 1 '' llm_load_print_meta: UNK token = 3 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 227 '<0x0A>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: GeForce RTX 3050 Laptop GPU, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.19 MiB llm_load_tensors: offloading 11 repeating layers to GPU llm_load_tensors: offloaded 11/29 layers to GPU llm_load_tensors: CPU buffer size = 4955.54 MiB llm_load_tensors: CUDA0 buffer size = 1633.76 MiB ........................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 544.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 352.00 MiB llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 506.00 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1302.88 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 20.00 MiB llama_new_context_with_model: graph nodes = 957 llama_new_context_with_model: graph splits = 191 CUDA error: CUBLAS_STATUS_ALLOC_FAILED current device: 0, in function cublas_handle at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:659 cublasCreate_v2(&cublas_handles[device]) GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:193: !"CUDA error"

Apr 01 '24 00:04 nanshaws

I would suggest giving the latest release a try to see if that improves the situation. That said, these may ultimately be due to #4599 which I'm still working on.

Jun 01 '24 20:06 dhiltgen

Please upgrade to the latest version (0.1.45) and this should be resolved now for CUDA cards.

Jun 22 '24 00:06 dhiltgen

ollama ollama copied to clipboard

CUDA out of memory error on Windows for ollama run starts up

ollama
ollama copied to clipboard