ollama icon indicating copy to clipboard operation
ollama copied to clipboard

CUDA out of memory error on Windows for ollama run starts up

Open boluny opened this issue 1 year ago • 5 comments

Hi there,

I just installed ollama 0.1.27 and tried to run gemma:2b but it suggest CUDA out of memory error. Could you please investigate and figure out root cause?

I'm using CPU i7-4700HQ with RAM 16G.

attached log and nvidia-smi report

+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 531.41 Driver Version: 531.41 CUDA Version: 12.1 | |-----------------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce GTX 960M WDDM | 00000000:02:00.0 Off | N/A | | N/A 0C P0 N/A / N/A| 181MiB / 4096MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 272 C+G ...s (x86)\Mozilla Firefox\firefox.exe N/A | | 0 N/A N/A 4520 C+G ....0_x64__8wekyb3d8bbwe\YourPhone.exe N/A | | 0 N/A N/A 7580 C+G ....Experiences.TextInput.InputApp.exe N/A | | 0 N/A N/A 9940 C+G ...2txyewy\StartMenuExperienceHost.exe N/A | | 0 N/A N/A 11012 C+G ...t.LockApp_cw5n1h2txyewy\LockApp.exe N/A | | 0 N/A N/A 12428 C+G ...cal\Microsoft\OneDrive\OneDrive.exe N/A | | 0 N/A N/A 13100 C+G ...s (x86)\Mozilla Firefox\firefox.exe N/A | | 0 N/A N/A 13332 C+G ...guoyun\bin-7.1.3\NutstoreClient.exe N/A | +---------------------------------------------------------------------------------------+

log:

[GIN] 2024/02/29 - 23:47:32 | 200 | 32.7µs | 127.0.0.1 | HEAD "/" [GIN] 2024/02/29 - 23:47:32 | 200 | 1.2447ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/02/29 - 23:47:32 | 200 | 2.4218ms | 127.0.0.1 | POST "/api/show" time=2024-02-29T23:47:37.171+08:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-02-29T23:47:37.171+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-02-29T23:47:37.216+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll C:\WINDOWS\system32\nvml.dll]" time=2024-02-29T23:47:37.236+08:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-02-29T23:47:37.236+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-29T23:47:37.248+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.0" time=2024-02-29T23:47:37.248+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-29T23:47:37.252+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.0" time=2024-02-29T23:47:37.253+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-02-29T23:47:37.253+08:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to time=2024-02-29T23:47:37.328+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\bolun\AppData\Local\Temp\ollama625311207\cuda_v11.3\ext_server.dll" time=2024-02-29T23:47:37.329+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 960M, compute capability 5.0, VMM: yes llama_model_loader: loaded meta data with 21 key-value pairs and 164 tensors from C:\Users\bolun.ollama\models\blobs\sha256-c1864a5eb19305c40519da12cc543519e48a0697ecd30e15d5ac228644957d12 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = gemma-2b-it llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.block_count u32 = 18 llama_model_loader: - kv 4: gemma.embedding_length u32 = 2048 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 16384 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 8 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 1 llama_model_loader: - kv 8: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 9: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 14: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 15: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,256128] = ["", "", "", "", ... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,256128] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,256128] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - kv 20: general.file_type u32 = 2 llama_model_loader: - type f32: 37 tensors llama_model_loader: - type q4_0: 126 tensors llama_model_loader: - type q8_0: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 544/256128 vs 388/256128 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256128 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 8 llm_load_print_meta: n_head_kv = 1 llm_load_print_meta: n_layer = 18 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 16384 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 2B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 2.51 B llm_load_print_meta: model size = 1.56 GiB (5.34 BPW) llm_load_print_meta: general.name = gemma-2b-it llm_load_print_meta: BOS token = 2 '' llm_load_print_meta: EOS token = 1 '' llm_load_print_meta: UNK token = 3 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_tensors: ggml ctx size = 0.13 MiB llm_load_tensors: offloading 18 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 19/19 layers to GPU llm_load_tensors: CPU buffer size = 531.52 MiB llm_load_tensors: CUDA0 buffer size = 1594.93 MiB ..................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 36.00 MiB llama_new_context_with_model: KV self size = 36.00 MiB, K (f16): 18.00 MiB, V (f16): 18.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 9.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 504.25 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 4.00 MiB llama_new_context_with_model: graph splits (measure): 3 CUDA error: out of memory current device: 0, in function ggml_cuda_pool_malloc_vmm at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:7990 cuMemSetAccess(g_cuda_pool_addr[device] + g_cuda_pool_size[device], reserve_size, &access, 1) GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error"

boluny avatar Feb 29 '24 16:02 boluny

cc @dhiltgen

pdevine avatar Mar 01 '24 01:03 pdevine

I am also experiencing the same error. Here is the error log: time=2024-03-02T10:57:13.946+08:00 level=INFO source=images.go:710 msg="total blobs: 17" time=2024-03-02T10:57:13.959+08:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-02T10:57:13.961+08:00 level=INFO source=routes.go:1019 msg="Listening on [::]:1123 (version 0.1.27)" time=2024-03-02T10:57:13.961+08:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-02T10:57:14.141+08:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx2 cpu cuda_v11.3 cpu_avx]" [GIN] 2024/03/02 - 10:57:14 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/02 - 10:57:14 | 200 | 2.3633ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/02 - 10:57:14 | 200 | 2.3207ms | 127.0.0.1 | POST "/api/show" time=2024-03-02T10:57:14.905+08:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-03-02T10:57:14.905+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-03-02T10:57:14.909+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll]" time=2024-03-02T10:57:14.926+08:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-03-02T10:57:14.926+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:14.941+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-02T10:57:14.941+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:14.941+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-02T10:57:14.941+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:14.941+08:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\Users\huan\AppData\Local\Temp\ollama2241795987\cuda_v11.3;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.3.1\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Users\huan\AppData\Local\Microsoft\WindowsApps;;C:\Users\huan\AppData\Local\Programs\Ollama" time=2024-03-02T10:57:15.046+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\huan\AppData\Local\Temp\ollama2241795987\cuda_v11.3\ext_server.dll" time=2024-03-02T10:57:15.047+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from C:\Users\huan.ollama\models\blobs\sha256-8934d96d3f08982e95922b2b7a2c626a1fe873d7c3b06e8e56d7bc0a1fef9246 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 11008 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 6.74 B llm_load_print_meta: model size = 3.56 GiB (4.54 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.22 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: CUDA0 buffer size = 3577.56 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 3 time=2024-03-02T10:57:27.527+08:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" [GIN] 2024/03/02 - 10:57:27 | 200 | 13.1901436s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/03/02 - 10:57:42 | 200 | 453.2522ms | 127.0.0.1 | POST "/api/chat" time=2024-03-02T10:57:56.201+08:00 level=INFO source=routes.go:78 msg="changing loaded model" time=2024-03-02T10:57:58.660+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:58.660+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-02T10:57:58.660+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:58.660+08:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 8.6" time=2024-03-02T10:57:58.660+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T10:57:58.660+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\huan\AppData\Local\Temp\ollama2241795987\cuda_v11.3\ext_server.dll" time=2024-03-02T10:57:58.660+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from C:\Users\huan.ollama\models\blobs\sha256-456402914e838a953e0cf80caa6adbe75383d9e63584a964f504a7bbb8f7aad9 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = gemma-7b-it llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.embedding_length u32 = 3072 llama_model_loader: - kv 4: gemma.block_count u32 = 28 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 24576 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - kv 23: general.file_type u32 = 2 llama_model_loader: - type f32: 57 tensors llama_model_loader: - type q4_0: 196 tensors llama_model_loader: - type q8_0: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_rot = 192 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 24576 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.54 B llm_load_print_meta: model size = 4.84 GiB (4.87 BPW) llm_load_print_meta: general.name = gemma-7b-it llm_load_print_meta: BOS token = 2 '' llm_load_print_meta: EOS token = 1 '' llm_load_print_meta: UNK token = 3 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 227 '<0x0A>' llm_load_tensors: ggml ctx size = 0.19 MiB llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: CPU buffer size = 796.88 MiB llm_load_tensors: CUDA0 buffer size = 4955.54 MiB ........................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070 Ti, compute capability 8.6, VMM: yes llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 11.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 506.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 6.00 MiB llama_new_context_with_model: graph splits (measure): 3 time=2024-03-02T10:58:13.011+08:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop" CUDA error: out of memory current device: 0, in function ggml_cuda_pool_malloc_vmm at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:7990 cuMemSetAccess(g_cuda_pool_addr[device] + g_cuda_pool_size[device], reserve_size, &access, 1) GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error"

hbqclh avatar Mar 02 '24 03:03 hbqclh

My spec: Ubuntu 22.04, 16.0 GiB RAM, GeForce 940MX (2048 MiB).

Using Gemma:2b, I face same CUDA OOM error when calling the Ollama API from my webapp. I do NOT face same error via Ollama run.

So I got Ollama working with Gemma from webapp by:

  • put NO user instruction in the System message of API payload (or maybe remove the system message entirely, I have not tried)
  • start the conversation with "Hello", any longer question like "Why is the sky blue" would not work
  • then, I can ask any long question

Strange, but it worked for me. Note that I do not face similar issue with other LLM like Mistral.

BTW this is my first ever Github comment. Many many thanks to the great Ollama team!

trandhbao avatar Mar 02 '24 12:03 trandhbao

Similar issue here, with a Ryzen 5700X, 32gb RAM, and dual GPUs:

time=2024-03-02T23:09:06.654-05:00 level=INFO source=images.go:710 msg="total blobs: 0" time=2024-03-02T23:09:06.655-05:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-02T23:09:06.655-05:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)" time=2024-03-02T23:09:06.655-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-02T23:09:06.811-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cuda_v11.3 cpu_avx cpu_avx2 cpu]" [GIN] 2024/03/02 - 23:09:23 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/02 - 23:09:23 | 404 | 528.3µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/02 - 23:09:24 | 200 | 492.9968ms | 127.0.0.1 | POST "/api/pull" [GIN] 2024/03/02 - 23:09:27 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/02 - 23:09:27 | 404 | 0s | 127.0.0.1 | POST "/api/show" time=2024-03-02T23:09:28.622-05:00 level=INFO source=download.go:136 msg="downloading e8a35b5937a5 in 42 100 MB part(s)" time=2024-03-02T23:10:36.482-05:00 level=INFO source=download.go:136 msg="downloading 43070e2d4e53 in 1 11 KB part(s)" time=2024-03-02T23:10:38.340-05:00 level=INFO source=download.go:136 msg="downloading e6836092461f in 1 42 B part(s)" time=2024-03-02T23:10:41.345-05:00 level=INFO source=download.go:136 msg="downloading ed11eda7790d in 1 30 B part(s)" time=2024-03-02T23:10:43.244-05:00 level=INFO source=download.go:136 msg="downloading f9b1e3196ecf in 1 483 B part(s)" [GIN] 2024/03/02 - 23:10:47 | 200 | 1m20s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/03/02 - 23:10:47 | 200 | 524.1µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/02 - 23:10:47 | 200 | 528.6µs | 127.0.0.1 | POST "/api/show" time=2024-03-02T23:10:47.816-05:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-03-02T23:10:47.816-05:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-03-02T23:10:47.840-05:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll]" time=2024-03-02T23:10:47.855-05:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-03-02T23:10:47.859-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T23:10:47.887-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-02T23:10:47.887-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T23:10:47.887-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-02T23:10:47.887-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-02T23:10:47.887-05:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\Users\RUCARA~1\AppData\Local\Temp\ollama991450673\cuda_v11.3;C:\Users\rucaradio\AppData\Local\Programs\Ollama;C:\Program Files\NVIDIA\CUDNN\v9.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;;;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files\Git\cmd;C:\Users\rucaradio\AppData\Roaming\nvm;C:\Program Files\nodejs;C:\Program Files\WindowsPowerShell\Scripts;C:\ProgramData\chocolatey\bin;;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.3.1\;C:\Program Files\Go\bin;C:\Program Files\PowerShell\7\;C:\Program Files\CMake\bin;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Users\rucaradio\.cargo\bin;C:\Users\rucaradio\scoop\shims;C:\Users\rucaradio\AppData\Local\Microsoft\WindowsApps;C:\Users\rucaradio\AppData\Local\Programs\Microsoft VS Code\bin;C:\ " time=2024-03-02T23:10:48.341-05:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\RUCARA~1\AppData\Local\Temp\ollama991450673\cuda_v11.3\ext_server.dll" time=2024-03-02T23:10:48.342-05:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Device 1: Quadro M6000, compute capability 5.2, VMM: yes llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\rucaradio.ollama\models\blobs\sha256-e8a35b5937a5e6d5c35d1f2a15f161e07eefe5e5bb0a3cdd42998ee79b057730 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,58980] = ["▁ t", "i n", "e r", "▁ a", "h e... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) llm_load_print_meta: general.name = mistralai llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.33 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: CUDA0 buffer size = 1989.53 MiB llm_load_tensors: CUDA1 buffer size = 1858.02 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 136.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 120.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA1 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 5 CUDA error: unspecified launch failure current device: 1, in function ggml_backend_cuda_buffer_cpy_tensor at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:10953 cudaDeviceSynchronize() GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error" time=2024-03-03T01:54:03.879-05:00 level=INFO source=images.go:710 msg="total blobs: 5" time=2024-03-03T01:54:03.884-05:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-03T01:54:03.885-05:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)" time=2024-03-03T01:54:03.885-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-03T01:54:04.032-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11.3 cpu]" [GIN] 2024/03/03 - 01:54:04 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/03 - 01:54:04 | 200 | 14.8896ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/03 - 01:54:04 | 200 | 505.4µs | 127.0.0.1 | POST "/api/show" time=2024-03-03T01:54:04.874-05:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-03-03T01:54:04.874-05:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-03-03T01:54:04.889-05:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll]" time=2024-03-03T01:54:04.895-05:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-03-03T01:54:04.907-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:54:04.942-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-03T01:54:04.942-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:54:04.942-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-03T01:54:04.942-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:54:04.942-05:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\Users\RUCARA~1\AppData\Local\Temp\ollama2071667329\cuda_v11.3;C:\Users\rucaradio\AppData\Local\Programs\Ollama;C:\Program Files\NVIDIA\CUDNN\v9.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;;;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files\Git\cmd;C:\Users\rucaradio\AppData\Roaming\nvm;C:\Program Files\nodejs;C:\Program Files\WindowsPowerShell\Scripts;C:\ProgramData\chocolatey\bin;;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.3.1\;C:\Program Files\Go\bin;C:\Program Files\PowerShell\7\;C:\Program Files\CMake\bin;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Users\rucaradio\.cargo\bin;C:\Users\rucaradio\scoop\shims;C:\Users\rucaradio\AppData\Local\Microsoft\WindowsApps;C:\Users\rucaradio\AppData\Local\Programs\Microsoft VS Code\bin;C:\ " time=2024-03-03T01:54:05.435-05:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\RUCARA~1\AppData\Local\Temp\ollama2071667329\cuda_v11.3\ext_server.dll" time=2024-03-03T01:54:05.435-05:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Device 1: Quadro M6000, compute capability 5.2, VMM: yes llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\rucaradio.ollama\models\blobs\sha256-e8a35b5937a5e6d5c35d1f2a15f161e07eefe5e5bb0a3cdd42998ee79b057730 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,58980] = ["▁ t", "i n", "e r", "▁ a", "h e... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) llm_load_print_meta: general.name = mistralai llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.33 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: CUDA0 buffer size = 1989.53 MiB llm_load_tensors: CUDA1 buffer size = 1858.02 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 136.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 120.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA1 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 5 CUDA error: unspecified launch failure current device: 1, in function ggml_backend_cuda_buffer_cpy_tensor at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:10953 cudaDeviceSynchronize() GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error" time=2024-03-03T01:55:31.714-05:00 level=INFO source=images.go:710 msg="total blobs: 5" time=2024-03-03T01:55:31.720-05:00 level=INFO source=images.go:717 msg="total unused blobs removed: 0" time=2024-03-03T01:55:31.722-05:00 level=INFO source=routes.go:1019 msg="Listening on 127.0.0.1:11434 (version 0.1.27)" time=2024-03-03T01:55:31.722-05:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..." time=2024-03-03T01:55:31.881-05:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu cpu_avx2 cuda_v11.3 cpu_avx]" [GIN] 2024/03/03 - 01:59:06 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/03/03 - 01:59:06 | 200 | 18.2741ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/03/03 - 01:59:06 | 200 | 549.2µs | 127.0.0.1 | POST "/api/show" time=2024-03-03T01:59:07.250-05:00 level=INFO source=gpu.go:94 msg="Detecting GPU type" time=2024-03-03T01:59:07.250-05:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library nvml.dll" time=2024-03-03T01:59:07.275-05:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [c:\Windows\System32\nvml.dll C:\Windows\system32\nvml.dll]" time=2024-03-03T01:59:07.300-05:00 level=INFO source=gpu.go:99 msg="Nvidia GPU detected" time=2024-03-03T01:59:07.302-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:59:07.332-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-03T01:59:07.332-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:59:07.332-05:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 5.2" time=2024-03-03T01:59:07.332-05:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-03-03T01:59:07.332-05:00 level=INFO source=dyn_ext_server.go:385 msg="Updating PATH to C:\Users\RUCARA~1\AppData\Local\Temp\ollama4153122201\cuda_v11.3;C:\Program Files\NVIDIA\CUDNN\v9.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;;;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files\Git\cmd;C:\Users\rucaradio\AppData\Roaming\nvm;C:\Program Files\nodejs;C:\Program Files\WindowsPowerShell\Scripts;C:\ProgramData\chocolatey\bin;;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.3.1\;C:\Program Files\Go\bin;C:\Program Files\PowerShell\7\;C:\Program Files\CMake\bin;C:\Program Files\NVIDIA\CUDNN\v9.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;;;;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\Program Files\dotnet\;C:\Program Files\Git\cmd;C:\Users\rucaradio\AppData\Roaming\nvm;C:\Program Files\nodejs;C:\Program Files\WindowsPowerShell\Scripts;C:\ProgramData\chocolatey\bin;;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Program Files\NVIDIA Corporation\Nsight Compute 2023.3.1\;C:\Program Files\Go\bin;C:\Program Files\PowerShell\7\;C:\Program Files\CMake\bin;C:\Users\rucaradio\.cargo\bin;C:\Users\rucaradio\scoop\shims;C:\Users\rucaradio\AppData\Local\Microsoft\WindowsApps;C:\Users\rucaradio\AppData\Local\Programs\Microsoft VS Code\bin;C:\Users\rucaradio\AppData\Roaming\nvm;C:\Program Files\nodejs;C:\;C:\Users\rucaradio\AppData\Local\Programs\Ollama" time=2024-03-03T01:59:07.843-05:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\Users\RUCARA~1\AppData\Local\Temp\ollama4153122201\cuda_v11.3\ext_server.dll" time=2024-03-03T01:59:07.843-05:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Device 1: Quadro M6000, compute capability 5.2, VMM: yes llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from C:\Users\rucaradio.ollama\models\blobs\sha256-e8a35b5937a5e6d5c35d1f2a15f161e07eefe5e5bb0a3cdd42998ee79b057730 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,58980] = ["▁ t", "i n", "e r", "▁ a", "h e... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) llm_load_print_meta: general.name = mistralai llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.33 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70.31 MiB llm_load_tensors: CUDA0 buffer size = 1989.53 MiB llm_load_tensors: CUDA1 buffer size = 1858.02 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 136.00 MiB llama_kv_cache_init: CUDA1 KV buffer size = 120.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA1 compute buffer size = 164.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.00 MiB llama_new_context_with_model: graph splits (measure): 5 CUDA error: unspecified launch failure current device: 1, in function ggml_backend_cuda_buffer_cpy_tensor at C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:10953 cudaDeviceSynchronize() GGML_ASSERT: C:\Users\jeff\git\ollama\llm\llama.cpp\ggml-cuda.cu:243: !"CUDA error"

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 551.61 Driver Version: 551.61 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Quadro M6000 WDDM | 00000000:05:00.0 On | Off | | 27% 51C P8 28W / 250W | 646MiB / 12288MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3060 WDDM | 00000000:0B:00.0 On | N/A | | 0% 35C P8 8W / 170W | 118MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 4436 C+G ...on\122.0.2365.59\msedgewebview2.exe N/A | | 0 N/A N/A 6612 C+G ...on\122.0.2365.59\msedgewebview2.exe N/A | | 0 N/A N/A 6792 C+G C:\Windows\explorer.exe N/A | | 0 N/A N/A 8600 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A | | 0 N/A N/A 9404 C+G ...2txyewy\StartMenuExperienceHost.exe N/A |

(base) C:\newpdev\ollama>NVCC -V NVCC: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Wed_Nov_22_10:30:42_Pacific_Standard_Time_2023 Cuda compilation tools, release 12.3, V12.3.107 Build cuda_12.3.r12.3/compiler.33567101_0

patrickdeluca avatar Mar 03 '24 07:03 patrickdeluca

time=2024-04-01T08:12:19.872+08:00 level=INFO source=gpu.go:115 msg="Detecting GPU type" time=2024-04-01T08:12:19.872+08:00 level=INFO source=gpu.go:265 msg="Searching for GPU management library cudart64_*.dll" time=2024-04-01T08:12:19.881+08:00 level=INFO source=gpu.go:311 msg="Discovered GPU libraries: [C:\Users\Administrator\AppData\Local\Programs\Ollama\cudart64_110.dll c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cudart64_110.dll C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\cudart64_110.dll]" time=2024-04-01T08:12:19.940+08:00 level=INFO source=gpu.go:120 msg="Nvidia GPU detected via cudart" time=2024-04-01T08:12:19.941+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-01T08:12:20.082+08:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6" time=2024-04-01T08:12:20.082+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-01T08:12:20.083+08:00 level=INFO source=gpu.go:188 msg="[cudart] CUDART CUDA Compute Capability detected: 8.6" time=2024-04-01T08:12:20.083+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-01T08:12:20.083+08:00 level=INFO source=assets.go:108 msg="Updating PATH to C:\Users\ADMINI~1\AppData\Local\Temp\ollama2476353147\runners\cuda_v11.3;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\libnvvp;C:\Program Files (x86)\jdk/bin;D:\work\graalvm-jdk-17_windows-x64_bin\graalvm-jdk-17.0.9+11.1\bin;D:\WindowsVSC\VC\Tools\MSVC\14.36.32532\bin\Hostx64\x64\;C:\Program Files\PlasticSCM5\server;C:\Program Files\PlasticSCM5\client;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;D:\work\apache-tomcat-9.0.1-windows-x64\apache-tomcat-9.0.1\bin\;D:\work\apache-maven-3.8.8-bin\apache-maven-3.8.8\bin\;D:\work\gradle-8.2.1-all\gradle-8.2.1\bin;D:\work\apache-jmeter-5.5\bin;D:\work\w64devkit-1.19.0\w64devkit\bin;C:\Program Files\Docker\Docker\resources\bin;C:\Program Files\MySQL\MySQL Server 8.0\bin;D:\Git\cmd;D:\python\;D:\nvm;C:\Program Files\nodejs;D:\work\visualvm_216\bin;D:\HashiCorp\Vagrant\bin;D:\weixin\微信web开发者工具\dll;D:\work\netcat-win32-1.12;D:\work\VMware-ovftool-4.5.0-20459872-win.x86_64\ovftool;D:\work\lu;D:\work\kotlin-compiler-1.9.22\kotlinc\bin;C:\Program Files\CMake\bin;C:\Program Files\NVIDIA Corporation\Nsight Compute 2020.3.0\;D:\miniconda3;D:\miniconda3\Library\mingw-w64\bin;D:\miniconda3\Library\usr\bin;D:\miniconda3\Library\bin;D:\miniconda3\Scripts;C:\Program Files\MySQL\MySQL Shell 8.0\bin\;C:\Users\Administrator\AppData\Local\Microsoft\WindowsApps;C:\Users\Administrator\AppData\Roaming\npm;D:\nvm;C:\Program Files\nodejs;D:\work\graalvm-jdk-17_windows-x64_bin\graalvm-jdk-17.0.9+11.1\bin\;D:\work\graalvm-jdk-17_windows-x64_bin\graalvm-jdk-17.0.9+11.1\jre\bin\;C:\Users\Administrator\AppData\Local\GitHubDesktop\bin;C:\Users\Administrator\.dotnet\tools;D:\work\mongosh\;;C:\Users\Administrator\AppData\Local\Programs\Ollama" loading library C:\Users\ADMINI~1\AppData\Local\Temp\ollama2476353147\runners\cuda_v11.3\ext_server.dll time=2024-04-01T08:12:20.099+08:00 level=INFO source=dyn_ext_server.go:87 msg="Loading Dynamic llm server: C:\Users\ADMINI~1\AppData\Local\Temp\ollama2476353147\runners\cuda_v11.3\ext_server.dll" time=2024-04-01T08:12:20.100+08:00 level=INFO source=dyn_ext_server.go:147 msg="Initializing llama server" llama_model_loader: loaded meta data with 24 key-value pairs and 254 tensors from D:\ollama\blobs\sha256-456402914e838a953e0cf80caa6adbe75383d9e63584a964f504a7bbb8f7aad9 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.name str = gemma-7b-it llama_model_loader: - kv 2: gemma.context_length u32 = 8192 llama_model_loader: - kv 3: gemma.embedding_length u32 = 3072 llama_model_loader: - kv 4: gemma.block_count u32 = 28 llama_model_loader: - kv 5: gemma.feed_forward_length u32 = 24576 llama_model_loader: - kv 6: gemma.attention.head_count u32 = 16 llama_model_loader: - kv 7: gemma.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 10: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,256000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 21: tokenizer.chat_template str = {% if messages[0]['role'] == 'system'... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - kv 23: general.file_type u32 = 2 llama_model_loader: - type f32: 57 tensors llama_model_loader: - type q4_0: 196 tensors llama_model_loader: - type q8_0: 1 tensors llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gemma llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 256000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_rot = 192 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 24576 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.54 B llm_load_print_meta: model size = 4.84 GiB (4.87 BPW) llm_load_print_meta: general.name = gemma-7b-it llm_load_print_meta: BOS token = 2 '' llm_load_print_meta: EOS token = 1 '' llm_load_print_meta: UNK token = 3 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 227 '<0x0A>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: GeForce RTX 3050 Laptop GPU, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.19 MiB llm_load_tensors: offloading 11 repeating layers to GPU llm_load_tensors: offloaded 11/29 layers to GPU llm_load_tensors: CPU buffer size = 4955.54 MiB llm_load_tensors: CUDA0 buffer size = 1633.76 MiB ........................................................................... llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 544.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 352.00 MiB llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 506.00 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1302.88 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 20.00 MiB llama_new_context_with_model: graph nodes = 957 llama_new_context_with_model: graph splits = 191 CUDA error: CUBLAS_STATUS_ALLOC_FAILED current device: 0, in function cublas_handle at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:659 cublasCreate_v2(&cublas_handles[device]) GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:193: !"CUDA error"

nanshaws avatar Apr 01 '24 00:04 nanshaws

I would suggest giving the latest release a try to see if that improves the situation. That said, these may ultimately be due to #4599 which I'm still working on.

dhiltgen avatar Jun 01 '24 20:06 dhiltgen

Please upgrade to the latest version (0.1.45) and this should be resolved now for CUDA cards.

dhiltgen avatar Jun 22 '24 00:06 dhiltgen