Running out of memory when allocating to second GPU

Open joshuakoh1 opened this issue 1 year ago • 5 comments

What is the issue?

No issues with any model that fits into a single 3090 but seems to run out of memory when trying to distribute to the second 3090.

INFO [wmain] starting c++ runner | tid="33768" timestamp=1729324300
INFO [wmain] build info | build=3670 commit="aad7f071" tid="33768" timestamp=1729324300
INFO [wmain] system info | n_threads=20 n_threads_batch=20 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="33768" timestamp=1729324300 total_threads=28
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="27" port="56651" tid="33768" timestamp=1729324300
llama_model_loader: loaded meta data with 41 key-value pairs and 724 tensors from C:\Users\Joshua\.ollama\models\blobs\sha256-001c9aacecbdca348f7c7c6d2b1a4120d447bf023afcacb3b864df023f1e2be4 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.1 70B Instruct
llama_model_loader: - kv   3:                       general.organization str              = Meta Llama
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Llama-3.1
llama_model_loader: - kv   6:                         general.size_label str              = 70B
llama_model_loader: - kv   7:                            general.license str              = llama3.1
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Llama 3.1 70B Instruct
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Meta Llama
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv  12:                               general.tags arr[str,3]       = ["nvidia", "llama3.1", "text-generati...
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                           general.datasets arr[str,1]       = ["nvidia/HelpSteer2"]
llama_model_loader: - kv  15:                          llama.block_count u32              = 80
llama_model_loader: - kv  16:                       llama.context_length u32              = 131072
llama_model_loader: - kv  17:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv  18:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv  19:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv  20:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  21:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  22:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  23:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  24:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  25:                          general.file_type u32              = 13
llama_model_loader: - kv  26:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  27:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  34:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - kv  37:                      quantize.imatrix.file str              = /models_out/Llama-3.1-Nemotron-70B-In...
llama_model_loader: - kv  38:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  39:             quantize.imatrix.entries_count i32              = 560
llama_model_loader: - kv  40:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:  162 tensors
llama_model_loader: - type q3_K:  321 tensors
llama_model_loader: - type q5_K:  240 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-10-19T15:51:40.427+08:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q3_K - Large
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 34.58 GiB (4.21 BPW) 
llm_load_print_meta: general.name     = Llama 3.1 70B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    1.02 MiB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   430.55 MiB
llm_load_tensors:      CUDA0 buffer size = 17507.01 MiB
llm_load_tensors:      CUDA1 buffer size = 17474.99 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1312.00 MiB on device 0: cudaMalloc failed: out of memory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: error: failed to create context with model 'C:\Users\Joshua\.ollama\models\blobs\sha256-001c9aacecbdca348f7c7c6d2b1a4120d447bf023afcacb3b864df023f1e2be4'
ERROR [load_model] unable to load model | model="C:\\Users\\Joshua\\.ollama\\models\\blobs\\sha256-001c9aacecbdca348f7c7c6d2b1a4120d447bf023afcacb3b864df023f1e2be4" tid="33768" timestamp=1729324312
time=2024-10-19T15:51:53.175+08:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2024-10-19T15:51:55.231+08:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
time=2024-10-19T15:51:55.734+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error:failed to create context with model 'C:\\Users\\Joshua\\.ollama\\models\\blobs\\sha256-001c9aacecbdca348f7c7c6d2b1a4120d447bf023afcacb3b864df023f1e2be4'"
[GIN] 2024/10/19 - 15:51:55 | 500 |   15.6142405s |       127.0.0.1 | POST     "/api/generate"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.3.13

Oct 19 '24 07:10 joshuakoh1

Please post the full log, there are earlier log lines about device detection and memory calculations that may be relevant. Also set OLLAMA_DEBUG=1 in the server environment, it may give more context on why the allocation failed.

Oct 19 '24 10:10 rick-github

Please post the full log, there are earlier log lines about device detection and memory calculations that may be relevant. Also set OLLAMA_DEBUG=1 in the server environment, it may give more context on why the allocation failed.

2024/10/20 06:30:28 routes.go:1158: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\Joshua\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-10-20T06:30:28.534+08:00 level=INFO source=images.go:754 msg="total blobs: 25"
time=2024-10-20T06:30:28.561+08:00 level=INFO source=images.go:761 msg="total unused blobs removed: 5"
time=2024-10-20T06:30:28.562+08:00 level=INFO source=routes.go:1205 msg="Listening on 127.0.0.1:11434 (version 0.3.13)"
time=2024-10-20T06:30:28.562+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe
time=2024-10-20T06:30:28.562+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe
time=2024-10-20T06:30:28.562+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe
time=2024-10-20T06:30:28.562+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe
time=2024-10-20T06:30:28.562+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe
time=2024-10-20T06:30:28.562+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_v6.1\ollama_llama_server.exe
time=2024-10-20T06:30:28.562+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cuda_v12 rocm_v6.1 cpu cpu_avx cpu_avx2 cuda_v11]"
time=2024-10-20T06:30:28.562+08:00 level=DEBUG source=common.go:50 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-10-20T06:30:28.562+08:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2024-10-20T06:30:28.562+08:00 level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
time=2024-10-20T06:30:28.562+08:00 level=DEBUG source=gpu.go:86 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-10-20T06:30:28.562+08:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=nvml.dll
time=2024-10-20T06:30:28.562+08:00 level=DEBUG source=gpu.go:491 msg="gpu library search" globs="[C:\\Users\\Joshua\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvml.dll C:\\Program Files\\Microsoft MPI\\Bin\\nvml.dll C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\javapath\\nvml.dll C:\\Windows\\system32\\nvml.dll C:\\Windows\\nvml.dll C:\\Windows\\System32\\Wbem\\nvml.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvml.dll C:\\Windows\\System32\\OpenSSH\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\130\\Tools\\Binn\\nvml.dll C:\\Program Files\\dotnet\\nvml.dll C:\\Program Files (x86)\\Common Files\\Acronis\\SnapAPI\\nvml.dll C:\\Program Files (x86)\\Common Files\\Acronis\\VirtualFile\\nvml.dll C:\\Program Files (x86)\\Common Files\\Acronis\\VirtualFile64\\nvml.dll C:\\Program Files (x86)\\Common Files\\Acronis\\FileProtector\\nvml.dll C:\\Program Files (x86)\\Common Files\\Acronis\\FileProtector64\\nvml.dll C:\\Program Files\\PuTTY\\nvml.dll C:\\Program Files\\Git\\cmd\\nvml.dll C:\\Program Files\\Go\\bin\\nvml.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll C:\\Users\\Joshua\\AppData\\Local\\NVIDIA\\ChatWithRTX\\env_nvd_rag\\Lib\\site-packages\\torch\\lib\\nvml.dll C:\\Program Files\\Common Files\\Autodesk Shared\\nvml.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvml.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR\\nvml.dll C:\\Program Files\\Docker\\Docker\\resources\\bin\\nvml.dll C:\\Users\\Joshua\\AppData\\Roaming\\nvm\\nvml.dll C:\\Program Files\\nodejs\\nvml.dll C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python311\\Scripts\\nvml.dll C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python311\\nvml.dll C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python312\\Scripts\\nvml.dll C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python312\\nvml.dll C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python310\\Scripts\\nvml.dll C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python310\\nvml.dll C:\\Program Files\\platform-tools\\nvml.dll C:\\Users\\Joshua\\AppData\\Local\\Microsoft\\WindowsApps\\nvml.dll C:\\Users\\Joshua\\.dotnet\\tools\\nvml.dll C:\\Users\\Joshua\\AppData\\Roaming\\npm\\nvml.dll C:\\Users\\Joshua\\go\\bin\\nvml.dll C:\\Program Files\\heroku\\bin\\nvml.dll C:\\Users\\Joshua\\.fly\\bin\\nvml.dll C:\\Users\\Joshua\\AppData\\Local\\GitHubDesktop\\bin\\nvml.dll C:\\Users\\Joshua\\AppData\\Local\\ffmpegio\\ffmpeg-downloader\\ffmpeg\\bin\\nvml.dll C:\\Users\\Joshua\\AppData\\Local\\Programs\\Ollama\\nvml.dll C:\\Users\\Joshua\\AppData\\Roaming\\nvm\\nvml.dll C:\\Program Files\\nodejs\\nvml.dll c:\\Windows\\System32\\nvml.dll]"
time=2024-10-20T06:30:28.563+08:00 level=DEBUG source=gpu.go:496 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvml.dll"
time=2024-10-20T06:30:28.563+08:00 level=DEBUG source=gpu.go:525 msg="discovered GPU libraries" paths="[C:\\Windows\\system32\\nvml.dll c:\\Windows\\System32\\nvml.dll]"
time=2024-10-20T06:30:28.576+08:00 level=DEBUG source=gpu.go:107 msg="nvidia-ml loaded" library=C:\Windows\system32\nvml.dll
time=2024-10-20T06:30:28.576+08:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=nvcuda.dll
time=2024-10-20T06:30:28.576+08:00 level=DEBUG source=gpu.go:491 msg="gpu library search" globs="[C:\\Users\\Joshua\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\nvcuda.dll C:\\Program Files\\Microsoft MPI\\Bin\\nvcuda.dll C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\javapath\\nvcuda.dll C:\\Windows\\system32\\nvcuda.dll C:\\Windows\\nvcuda.dll C:\\Windows\\System32\\Wbem\\nvcuda.dll C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\nvcuda.dll C:\\Windows\\System32\\OpenSSH\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\130\\Tools\\Binn\\nvcuda.dll C:\\Program Files\\dotnet\\nvcuda.dll C:\\Program Files (x86)\\Common Files\\Acronis\\SnapAPI\\nvcuda.dll C:\\Program Files (x86)\\Common Files\\Acronis\\VirtualFile\\nvcuda.dll C:\\Program Files (x86)\\Common Files\\Acronis\\VirtualFile64\\nvcuda.dll C:\\Program Files (x86)\\Common Files\\Acronis\\FileProtector\\nvcuda.dll C:\\Program Files (x86)\\Common Files\\Acronis\\FileProtector64\\nvcuda.dll C:\\Program Files\\PuTTY\\nvcuda.dll C:\\Program Files\\Git\\cmd\\nvcuda.dll C:\\Program Files\\Go\\bin\\nvcuda.dll C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll C:\\Users\\Joshua\\AppData\\Local\\NVIDIA\\ChatWithRTX\\env_nvd_rag\\Lib\\site-packages\\torch\\lib\\nvcuda.dll C:\\Program Files\\Common Files\\Autodesk Shared\\nvcuda.dll C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\nvcuda.dll C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR\\nvcuda.dll C:\\Program Files\\Docker\\Docker\\resources\\bin\\nvcuda.dll C:\\Users\\Joshua\\AppData\\Roaming\\nvm\\nvcuda.dll C:\\Program Files\\nodejs\\nvcuda.dll C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python311\\Scripts\\nvcuda.dll C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python311\\nvcuda.dll C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python312\\Scripts\\nvcuda.dll C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python312\\nvcuda.dll C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python310\\Scripts\\nvcuda.dll C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python310\\nvcuda.dll C:\\Program Files\\platform-tools\\nvcuda.dll C:\\Users\\Joshua\\AppData\\Local\\Microsoft\\WindowsApps\\nvcuda.dll C:\\Users\\Joshua\\.dotnet\\tools\\nvcuda.dll C:\\Users\\Joshua\\AppData\\Roaming\\npm\\nvcuda.dll C:\\Users\\Joshua\\go\\bin\\nvcuda.dll C:\\Program Files\\heroku\\bin\\nvcuda.dll C:\\Users\\Joshua\\.fly\\bin\\nvcuda.dll C:\\Users\\Joshua\\AppData\\Local\\GitHubDesktop\\bin\\nvcuda.dll C:\\Users\\Joshua\\AppData\\Local\\ffmpegio\\ffmpeg-downloader\\ffmpeg\\bin\\nvcuda.dll C:\\Users\\Joshua\\AppData\\Local\\Programs\\Ollama\\nvcuda.dll C:\\Users\\Joshua\\AppData\\Roaming\\nvm\\nvcuda.dll C:\\Program Files\\nodejs\\nvcuda.dll c:\\windows\\system*\\nvcuda.dll]"
time=2024-10-20T06:30:28.576+08:00 level=DEBUG source=gpu.go:496 msg="skipping PhysX cuda library path" path="C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common\\nvcuda.dll"
time=2024-10-20T06:30:28.577+08:00 level=DEBUG source=gpu.go:525 msg="discovered GPU libraries" paths=[C:\Windows\system32\nvcuda.dll]
time=2024-10-20T06:30:28.617+08:00 level=DEBUG source=gpu.go:118 msg="detected GPUs" count=2 library=C:\Windows\system32\nvcuda.dll
time=2024-10-20T06:30:28.870+08:00 level=INFO source=gpu.go:292 msg="detected OS VRAM overhead" id=GPU-2 library=cuda compute=8.6 driver=12.7 name="NVIDIA GeForce RTX 3090" overhead="613.6 MiB"
time=2024-10-20T06:30:28.891+08:00 level=DEBUG source=amd_hip_windows.go:88 msg=hipDriverGetVersion version=60241512
time=2024-10-20T06:30:28.892+08:00 level=INFO source=amd_hip_windows.go:103 msg="AMD ROCm reports no devices found"
time=2024-10-20T06:30:28.894+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-1 library=cuda variant=v12 compute=8.6 driver=12.7 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB"
time=2024-10-20T06:30:28.894+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-2 library=cuda variant=v12 compute=8.6 driver=12.7 name="NVIDIA GeForce RTX 3090" total="24.0 GiB" available="22.8 GiB"
[GIN] 2024/10/20 - 06:30:41 | 200 |     16.1023ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2024/10/20 - 06:30:41 | 200 |            0s |       127.0.0.1 | GET      "/api/version"
[GIN] 2024/10/20 - 06:30:47 | 200 |            0s |       127.0.0.1 | GET      "/api/version"
time=2024-10-20T06:31:00.038+08:00 level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.8 GiB" before.free="17.0 GiB" before.free_swap="33.7 GiB" now.total="31.8 GiB" now.free="16.9 GiB" now.free_swap="33.6 GiB"
time=2024-10-20T06:31:00.048+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.5 GiB" now.used="1.5 GiB"
time=2024-10-20T06:31:00.054+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-2 name="NVIDIA GeForce RTX 3090" overhead="613.6 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="656.4 MiB"
time=2024-10-20T06:31:00.055+08:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0xa1cca0 gpu_count=2
time=2024-10-20T06:31:00.074+08:00 level=DEBUG source=sched.go:224 msg="loading first model" model=C:\Users\Joshua\.ollama\models\blobs\sha256-001c9aacecbdca348f7c7c6d2b1a4120d447bf023afcacb3b864df023f1e2be4
time=2024-10-20T06:31:00.074+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.8 GiB]"
time=2024-10-20T06:31:00.074+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-10-20T06:31:00.075+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.8 GiB]"
time=2024-10-20T06:31:00.076+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=1 available="[22.5 GiB]"
time=2024-10-20T06:31:00.076+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=2 available="[22.8 GiB 22.5 GiB]"
time=2024-10-20T06:31:00.076+08:00 level=INFO source=sched.go:730 msg="new model will fit in available VRAM, loading" model=C:\Users\Joshua\.ollama\models\blobs\sha256-001c9aacecbdca348f7c7c6d2b1a4120d447bf023afcacb3b864df023f1e2be4 library=cuda parallel=4 required="40.7 GiB"
time=2024-10-20T06:31:00.076+08:00 level=INFO source=server.go:108 msg="system memory" total="31.8 GiB" free="16.9 GiB" free_swap="33.6 GiB"
time=2024-10-20T06:31:00.076+08:00 level=DEBUG source=memory.go:103 msg=evaluating library=cuda gpu_count=2 available="[22.8 GiB 22.5 GiB]"
time=2024-10-20T06:31:00.077+08:00 level=INFO source=memory.go:326 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=81 layers.split=41,40 memory.available="[22.8 GiB 22.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="40.7 GiB" memory.required.partial="40.7 GiB" memory.required.kv="2.5 GiB" memory.required.allocations="[20.7 GiB 19.9 GiB]" memory.weights.total="35.9 GiB" memory.weights.repeating="35.1 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
time=2024-10-20T06:31:00.077+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe
time=2024-10-20T06:31:00.077+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe
time=2024-10-20T06:31:00.077+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe
time=2024-10-20T06:31:00.077+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe
time=2024-10-20T06:31:00.077+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe
time=2024-10-20T06:31:00.077+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_v6.1\ollama_llama_server.exe
time=2024-10-20T06:31:00.077+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu\ollama_llama_server.exe
time=2024-10-20T06:31:00.077+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx\ollama_llama_server.exe
time=2024-10-20T06:31:00.077+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\cpu_avx2\ollama_llama_server.exe
time=2024-10-20T06:31:00.077+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v11\ollama_llama_server.exe
time=2024-10-20T06:31:00.077+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\cuda_v12\ollama_llama_server.exe
time=2024-10-20T06:31:00.077+08:00 level=DEBUG source=common.go:294 msg="availableServers : found" file=C:\Users\Joshua\AppData\Local\Programs\Ollama\lib\ollama\runners\rocm_v6.1\ollama_llama_server.exe
time=2024-10-20T06:31:00.082+08:00 level=INFO source=server.go:399 msg="starting llama server" cmd="C:\\Users\\Joshua\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe --model C:\\Users\\Joshua\\.ollama\\models\\blobs\\sha256-001c9aacecbdca348f7c7c6d2b1a4120d447bf023afcacb3b864df023f1e2be4 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --verbose --no-mmap --parallel 4 --tensor-split 41,40 --port 49650"
time=2024-10-20T06:31:00.082+08:00 level=DEBUG source=server.go:416 msg=subprocess environment="[PATH=C:\\Users\\Joshua\\AppData\\Local\\Programs\\Ollama\\lib\\ollama;C:\\Users\\Joshua\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cuda_v12;C:\\Program Files\\Microsoft MPI\\Bin\\;C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\javapath;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\Microsoft SQL Server\\130\\Tools\\Binn\\;C:\\Program Files\\dotnet\\;C:\\Program Files (x86)\\Common Files\\Acronis\\SnapAPI\\;C:\\Program Files (x86)\\Common Files\\Acronis\\VirtualFile\\;C:\\Program Files (x86)\\Common Files\\Acronis\\VirtualFile64\\;C:\\Program Files (x86)\\Common Files\\Acronis\\FileProtector\\;C:\\Program Files (x86)\\Common Files\\Acronis\\FileProtector64\\;C:\\Program Files\\PuTTY\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\Go\\bin;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Users\\Joshua\\AppData\\Local\\NVIDIA\\ChatWithRTX\\env_nvd_rag\\Lib\\site-packages\\torch\\lib;C:\\Program Files\\Common Files\\Autodesk Shared\\;C:\\Program Files\\Microsoft SQL Server\\150\\Tools\\Binn\\;C:\\Program Files\\NVIDIA Corporation\\NVIDIA app\\NvDLISR;C:\\Program Files\\Docker\\Docker\\resources\\bin;C:\\Users\\Joshua\\AppData\\Roaming\\nvm;C:\\Program Files\\nodejs;C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python311\\Scripts\\;C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python311\\;C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python312\\Scripts\\;C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python312\\;C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python310\\Scripts\\;C:\\Users\\Joshua\\AppData\\Local\\Programs\\Python\\Python310\\;C:\\Program Files\\platform-tools;C:\\Users\\Joshua\\AppData\\Local\\Microsoft\\WindowsApps;C:\\Users\\Joshua\\.dotnet\\tools;C:\\Users\\Joshua\\AppData\\Roaming\\npm;C:\\Users\\Joshua\\go\\bin;C:\\Program Files\\heroku\\bin;C:\\Users\\Joshua\\.fly\\bin;C:\\Users\\Joshua\\AppData\\Local\\GitHubDesktop\\bin;C:\\Users\\Joshua\\AppData\\Local\\ffmpegio\\ffmpeg-downloader\\ffmpeg\\bin;C:\\Users\\Joshua\\AppData\\Local\\Programs\\Ollama;C:\\Users\\Joshua\\AppData\\Roaming\\nvm;C:\\Program Files\\nodejs CUDA_VISIBLE_DEVICES=GPU-2,GPU-1]"
time=2024-10-20T06:31:00.114+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-10-20T06:31:00.114+08:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2024-10-20T06:31:00.114+08:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
INFO [wmain] starting c++ runner | tid="45116" timestamp=1729377060
INFO [wmain] build info | build=3670 commit="aad7f071" tid="45116" timestamp=1729377060
INFO [wmain] system info | n_threads=20 n_threads_batch=20 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="45116" timestamp=1729377060 total_threads=28
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="27" port="49650" tid="45116" timestamp=1729377060
llama_model_loader: loaded meta data with 41 key-value pairs and 724 tensors from C:\Users\Joshua\.ollama\models\blobs\sha256-001c9aacecbdca348f7c7c6d2b1a4120d447bf023afcacb3b864df023f1e2be4 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.1 70B Instruct
llama_model_loader: - kv   3:                       general.organization str              = Meta Llama
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Llama-3.1
llama_model_loader: - kv   6:                         general.size_label str              = 70B
llama_model_loader: - kv   7:                            general.license str              = llama3.1
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Llama 3.1 70B Instruct
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Meta Llama
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv  12:                               general.tags arr[str,3]       = ["nvidia", "llama3.1", "text-generati...
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                           general.datasets arr[str,1]       = ["nvidia/HelpSteer2"]
llama_model_loader: - kv  15:                          llama.block_count u32              = 80
llama_model_loader: - kv  16:                       llama.context_length u32              = 131072
llama_model_loader: - kv  17:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv  18:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv  19:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv  20:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  21:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  22:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  23:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  24:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  25:                          general.file_type u32              = 13
llama_model_loader: - kv  26:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  27:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = llama-bpe
time=2024-10-20T06:31:00.366+08:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  34:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - kv  37:                      quantize.imatrix.file str              = /models_out/Llama-3.1-Nemotron-70B-In...
llama_model_loader: - kv  38:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  39:             quantize.imatrix.entries_count i32              = 560
llama_model_loader: - kv  40:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:  162 tensors
llama_model_loader: - type q3_K:  321 tensors
llama_model_loader: - type q5_K:  240 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q3_K - Large
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 34.58 GiB (4.21 BPW) 
llm_load_print_meta: general.name     = Llama 3.1 70B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    1.02 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 17474.99 MiB on device 1: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: exception loading model
time=2024-10-20T06:31:01.822+08:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2024-10-20T06:31:03.884+08:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
time=2024-10-20T06:31:05.188+08:00 level=DEBUG source=server.go:439 msg="llama runner terminated" error="exit status 0xc0000409"
time=2024-10-20T06:31:05.389+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error loading model: unable to allocate backend buffer"
time=2024-10-20T06:31:05.389+08:00 level=DEBUG source=sched.go:458 msg="triggering expiration for failed load" model=C:\Users\Joshua\.ollama\models\blobs\sha256-001c9aacecbdca348f7c7c6d2b1a4120d447bf023afcacb3b864df023f1e2be4
time=2024-10-20T06:31:05.389+08:00 level=DEBUG source=sched.go:360 msg="runner expired event received" modelPath=C:\Users\Joshua\.ollama\models\blobs\sha256-001c9aacecbdca348f7c7c6d2b1a4120d447bf023afcacb3b864df023f1e2be4
time=2024-10-20T06:31:05.389+08:00 level=DEBUG source=sched.go:375 msg="got lock to unload" modelPath=C:\Users\Joshua\.ollama\models\blobs\sha256-001c9aacecbdca348f7c7c6d2b1a4120d447bf023afcacb3b864df023f1e2be4
time=2024-10-20T06:31:05.389+08:00 level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.8 GiB" before.free="16.9 GiB" before.free_swap="33.6 GiB" now.total="31.8 GiB" now.free="17.1 GiB" now.free_swap="34.6 GiB"
[GIN] 2024/10/20 - 06:31:05 | 500 |     5.373197s |       127.0.0.1 | POST     "/api/chat"
time=2024-10-20T06:31:05.399+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="24.0 GiB" before.free="22.5 GiB" now.total="24.0 GiB" now.free="22.6 GiB" now.used="1.4 GiB"
time=2024-10-20T06:31:05.404+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-2 name="NVIDIA GeForce RTX 3090" overhead="613.6 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="656.4 MiB"
time=2024-10-20T06:31:05.405+08:00 level=DEBUG source=server.go:1097 msg="stopping llama server"
time=2024-10-20T06:31:05.405+08:00 level=DEBUG source=sched.go:380 msg="runner released" modelPath=C:\Users\Joshua\.ollama\models\blobs\sha256-001c9aacecbdca348f7c7c6d2b1a4120d447bf023afcacb3b864df023f1e2be4
time=2024-10-20T06:31:05.655+08:00 level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.8 GiB" before.free="17.1 GiB" before.free_swap="34.6 GiB" now.total="31.8 GiB" now.free="17.1 GiB" now.free_swap="34.6 GiB"
time=2024-10-20T06:31:05.666+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="24.0 GiB" before.free="22.6 GiB" now.total="24.0 GiB" now.free="22.6 GiB" now.used="1.4 GiB"
time=2024-10-20T06:31:05.671+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-2 name="NVIDIA GeForce RTX 3090" overhead="613.6 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="656.4 MiB"
time=2024-10-20T06:31:05.906+08:00 level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.8 GiB" before.free="17.1 GiB" before.free_swap="34.6 GiB" now.total="31.8 GiB" now.free="17.1 GiB" now.free_swap="34.6 GiB"
time=2024-10-20T06:31:05.915+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="24.0 GiB" before.free="22.6 GiB" now.total="24.0 GiB" now.free="22.6 GiB" now.used="1.4 GiB"
time=2024-10-20T06:31:05.921+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-2 name="NVIDIA GeForce RTX 3090" overhead="613.6 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="656.4 MiB"
time=2024-10-20T06:31:06.155+08:00 level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.8 GiB" before.free="17.1 GiB" before.free_swap="34.6 GiB" now.total="31.8 GiB" now.free="17.1 GiB" now.free_swap="34.6 GiB"
time=2024-10-20T06:31:06.169+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="24.0 GiB" before.free="22.6 GiB" now.total="24.0 GiB" now.free="22.6 GiB" now.used="1.4 GiB"
time=2024-10-20T06:31:06.174+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-2 name="NVIDIA GeForce RTX 3090" overhead="613.6 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="656.4 MiB"
time=2024-10-20T06:31:06.406+08:00 level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.8 GiB" before.free="17.1 GiB" before.free_swap="34.6 GiB" now.total="31.8 GiB" now.free="17.1 GiB" now.free_swap="34.6 GiB"
time=2024-10-20T06:31:06.415+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="24.0 GiB" before.free="22.6 GiB" now.total="24.0 GiB" now.free="22.6 GiB" now.used="1.4 GiB"
time=2024-10-20T06:31:06.420+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-2 name="NVIDIA GeForce RTX 3090" overhead="613.6 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="656.4 MiB"
time=2024-10-20T06:31:06.656+08:00 level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.8 GiB" before.free="17.1 GiB" before.free_swap="34.6 GiB" now.total="31.8 GiB" now.free="17.1 GiB" now.free_swap="34.6 GiB"
time=2024-10-20T06:31:06.669+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="24.0 GiB" before.free="22.6 GiB" now.total="24.0 GiB" now.free="22.6 GiB" now.used="1.4 GiB"
time=2024-10-20T06:31:06.674+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-2 name="NVIDIA GeForce RTX 3090" overhead="613.6 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="656.4 MiB"
time=2024-10-20T06:31:06.906+08:00 level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.8 GiB" before.free="17.1 GiB" before.free_swap="34.6 GiB" now.total="31.8 GiB" now.free="17.1 GiB" now.free_swap="34.6 GiB"
time=2024-10-20T06:31:06.916+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="24.0 GiB" before.free="22.6 GiB" now.total="24.0 GiB" now.free="22.6 GiB" now.used="1.4 GiB"
time=2024-10-20T06:31:06.921+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-2 name="NVIDIA GeForce RTX 3090" overhead="613.6 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="656.4 MiB"
time=2024-10-20T06:31:07.155+08:00 level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.8 GiB" before.free="17.1 GiB" before.free_swap="34.6 GiB" now.total="31.8 GiB" now.free="17.1 GiB" now.free_swap="34.6 GiB"
time=2024-10-20T06:31:07.167+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="24.0 GiB" before.free="22.6 GiB" now.total="24.0 GiB" now.free="22.6 GiB" now.used="1.4 GiB"
time=2024-10-20T06:31:07.173+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-2 name="NVIDIA GeForce RTX 3090" overhead="613.6 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="656.4 MiB"
time=2024-10-20T06:31:07.406+08:00 level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.8 GiB" before.free="17.1 GiB" before.free_swap="34.6 GiB" now.total="31.8 GiB" now.free="17.1 GiB" now.free_swap="34.6 GiB"
time=2024-10-20T06:31:07.417+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="24.0 GiB" before.free="22.6 GiB" now.total="24.0 GiB" now.free="22.6 GiB" now.used="1.4 GiB"
time=2024-10-20T06:31:07.423+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-2 name="NVIDIA GeForce RTX 3090" overhead="613.6 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="656.4 MiB"
time=2024-10-20T06:31:07.656+08:00 level=DEBUG source=gpu.go:359 msg="updating system memory data" before.total="31.8 GiB" before.free="17.1 GiB" before.free_swap="34.6 GiB" now.total="31.8 GiB" now.free="17.0 GiB" now.free_swap="34.4 GiB"
time=2024-10-20T06:31:07.669+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-1 name="NVIDIA GeForce RTX 3090" overhead="0 B" before.total="24.0 GiB" before.free="22.6 GiB" now.total="24.0 GiB" now.free="22.6 GiB" now.used="1.4 GiB"
time=2024-10-20T06:31:07.675+08:00 level=DEBUG source=gpu.go:407 msg="updating cuda memory data" gpu=GPU-2 name="NVIDIA GeForce RTX 3090" overhead="613.6 MiB" before.total="24.0 GiB" before.free="22.8 GiB" now.total="24.0 GiB" now.free="22.8 GiB" now.used="656.4 MiB"
time=2024-10-20T06:31:07.677+08:00 level=DEBUG source=sched.go:659 msg="gpu VRAM free memory converged after 2.29 seconds" model=C:\Users\Joshua\.ollama\models\blobs\sha256-001c9aacecbdca348f7c7c6d2b1a4120d447bf023afcacb3b864df023f1e2be4
time=2024-10-20T06:31:07.677+08:00 level=DEBUG source=sched.go:384 msg="sending an unloaded event" modelPath=C:\Users\Joshua\.ollama\models\blobs\sha256-001c9aacecbdca348f7c7c6d2b1a4120d447bf023afcacb3b864df023f1e2be4
time=2024-10-20T06:31:07.677+08:00 level=DEBUG source=sched.go:308 msg="ignoring unload event with no pending requests"

Oct 19 '24 22:10 joshuakoh1

Unfortunately it's not clear why the alloc failed, it seems like there should be plenty of VRAM available. The 0xc0000409 (STATUS_STACK_BUFFER_OVERRUN) exit status suggests that recovery from the failed alloc also failed, and the runner crashed. So it could indicate some deeper problem which would need further investigation.

There are some workarounds that you can try to get the model to load.

Reduce the number of layers offloaded to the GPU by setting explicitly setting num_gpu. See here for details. You can find the current value by searching for layers.model in the logs. For this model it's 81, so try 75 and if that works, increase it until you get a failure.
Set OLLAMA_FLASH_ATTENTION=1 in the server environment. Flash attention is a more efficient use of KV space and reduces memory pressure.

Oct 20 '24 12:10 rick-github

Unfortunately it's not clear why the alloc failed, it seems like there should be plenty of VRAM available. The 0xc0000409 (STATUS_STACK_BUFFER_OVERRUN) exit status suggests that recovery from the failed alloc also failed, and the runner crashed. So it could indicate some deeper problem which would need further investigation.

There are some workarounds that you can try to get the model to load.

Reduce the number of layers offloaded to the GPU by setting explicitly setting num_gpu. See here for details. You can find the current value by searching for layers.model in the logs. For this model it's 81, so try 75 and if that works, increase it until you get a failure.

Set OLLAMA_FLASH_ATTENTION=1 in the server environment. Flash attention is a more efficient use of KV space and reduces memory pressure.

I'm getting the Error: llama runner process has terminated: error loading model: unable to allocate backend buffer error while loading the model so I can't get to the CLI to run the /set command

Oct 22 '24 09:10 joshuakoh1

Create a new model:

$ ollama show --modelfile llama3.1:70b-instruct-q3_K_L | sed -e 's/^FROM.*/FROM llama3.1:70b-instruct-q3_K_L/' > Modelfile
$ echo "PARAMETER num_gpu 75" >> Modelfile
$ ollama create llama3.1:70b-instruct-q3_K_L-ng75
$ ollama run llama3.1:70b-instruct-q3_K_L-ng75 "hello there"

Oct 22 '24 09:10 rick-github