ollama v0.1.33 can't load gemma:7b-instruct-v1.1-fp16 due to failed to create context with model

What is the issue?

Ollama v0.1.33 Intel Core i9 14900K 96GB ram Nvidia RTX 4070 TI Super 16GB

Attempts to load the gemma:7b-instruct-v1.1-fp16 are failing.

I have tried

restarting Ollama.
deleting and downloading the model

I do not have issues loading other kinds of models of different sizes. Just this one. I don't have any other processes using the GPU.

May 04 13:28:00 quorra ollama[537684]: [GIN] 2024/05/04 - 13:28:00 | 200 |      43.339µs |       127.0.0.1 | HEAD     "/"
May 04 13:28:00 quorra ollama[537684]: [GIN] 2024/05/04 - 13:28:00 | 200 |    1.371607ms |       127.0.0.1 | POST     "/api/show"
May 04 13:28:00 quorra ollama[537684]: [GIN] 2024/05/04 - 13:28:00 | 200 |     651.825µs |       127.0.0.1 | POST     "/api/show"
May 04 13:28:00 quorra ollama[537684]: time=2024-05-04T13:28:00.520Z level=INFO source=gpu.go:96 msg="Detecting GPUs"
May 04 13:28:00 quorra ollama[537684]: time=2024-05-04T13:28:00.520Z level=DEBUG source=gpu.go:203 msg="Searching for GPU library" name=libcudart.so*
May 04 13:28:00 quorra ollama[537684]: time=2024-05-04T13:28:00.520Z level=DEBUG source=gpu.go:221 msg="gpu library search" globs="[/tmp/ollama4249562492/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so* /libcudart.so**]"
May 04 13:28:00 quorra ollama[537684]: time=2024-05-04T13:28:00.524Z level=DEBUG source=gpu.go:249 msg="discovered GPU libraries" paths=[/tmp/ollama4249562492/runners/cuda_v11/libcudart.so.11.0]
May 04 13:28:00 quorra ollama[537684]: CUDA driver version: 12-4
May 04 13:28:00 quorra ollama[537684]: time=2024-05-04T13:28:00.526Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama4249562492/runners/cuda_v11/libcudart.so.11.0 count=1
May 04 13:28:00 quorra ollama[537684]: time=2024-05-04T13:28:00.526Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
May 04 13:28:00 quorra ollama[537684]: [GPU-007c9d9a-8177-bd6f-7654-45652102b937] CUDA totalMem 16852516864
May 04 13:28:00 quorra ollama[537684]: [GPU-007c9d9a-8177-bd6f-7654-45652102b937] CUDA freeMem 16627466240
May 04 13:28:00 quorra ollama[537684]: [GPU-007c9d9a-8177-bd6f-7654-45652102b937] Compute Capability 8.9
May 04 13:28:00 quorra ollama[537684]: time=2024-05-04T13:28:00.591Z level=DEBUG source=amd_linux.go:297 msg="amdgpu driver not detected /sys/module/amdgpu"
May 04 13:28:00 quorra ollama[537684]: releasing cudart library
May 04 13:28:00 quorra ollama[537684]: time=2024-05-04T13:28:00.603Z level=DEBUG source=gguf.go:57 msg="model = &llm.gguf{containerGGUF:(*llm.containerGGUF)(0xc0007a0340), kv:llm.KV{}, tensors:[]*llm.Tensor(nil), parameters:0x0}"
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=DEBUG source=sched.go:162 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-8374dc9da250dffb1ef78505964e8c072fe6688882f93dd72cb870c8a6f0981b
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=DEBUG source=memory.go:64 msg=evaluating library=cuda gpu_count=1 available="15857.2 MiB"
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=25 memory.available="15857.2 MiB" memory.required.full="17919.7 MiB" memory.required.partial="15384.8 MiB" memory.required.kv="672.0 MiB" memory.weights.total="16284.7 MiB" memory.weights.repeating="14784.7 MiB" memory.weights.nonrepeating="1500.0 MiB" memory.graph.full="506.0 MiB" memory.graph.partial="1127.2 MiB"
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=DEBUG source=memory.go:64 msg=evaluating library=cuda gpu_count=1 available="15857.2 MiB"
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=25 memory.available="15857.2 MiB" memory.required.full="17919.7 MiB" memory.required.partial="15384.8 MiB" memory.required.kv="672.0 MiB" memory.weights.total="16284.7 MiB" memory.weights.repeating="14784.7 MiB" memory.weights.nonrepeating="1500.0 MiB" memory.graph.full="506.0 MiB" memory.graph.partial="1127.2 MiB"
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=DEBUG source=memory.go:64 msg=evaluating library=cuda gpu_count=1 available="15857.2 MiB"
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=INFO source=memory.go:152 msg="offload to gpu" layers.real=-1 layers.estimate=25 memory.available="15857.2 MiB" memory.required.full="17919.7 MiB" memory.required.partial="15384.8 MiB" memory.required.kv="672.0 MiB" memory.weights.total="16284.7 MiB" memory.weights.repeating="14784.7 MiB" memory.weights.nonrepeating="1500.0 MiB" memory.graph.full="506.0 MiB" memory.graph.partial="1127.2 MiB"
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4249562492/runners/cpu
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4249562492/runners/cpu_avx
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4249562492/runners/cpu_avx2
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4249562492/runners/cuda_v11
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4249562492/runners/rocm_v60002
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4249562492/runners/cpu
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4249562492/runners/cpu_avx
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4249562492/runners/cpu_avx2
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4249562492/runners/cuda_v11
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama4249562492/runners/rocm_v60002
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=INFO source=server.go:289 msg="starting llama server" cmd="/tmp/ollama4249562492/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8374dc9da250dffb1ef78505964e8c072fe6688882f93dd72cb870c8a6f0981b --ctx-size 2048 --batch-size 512 --embedding --log-format json --n-gpu-layers 25 --verbose --parallel 1 --port 41879"
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.352Z level=DEBUG source=server.go:291 msg=subprocess environment="[LANG=en_US.UTF-8 PATH=/home/mark/.vscode-server/cli/servers/Stable-b58957e67ee1e712cebf466b995adf4c5307b2bd/server/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin HOME=/usr/share/ollama LOGNAME=ollama USER=ollama INVOCATION_ID=34ea3edcb48e4ac1893625adf0208cf3 JOURNAL_STREAM=8:224924 SYSTEMD_EXEC_PID=537684 OLLAMA_HOST=0.0.0.0 OLLAMA_DEBUG=1 LD_LIBRARY_PATH=/tmp/ollama4249562492/runners/cuda_v11 CUDA_VISIBLE_DEVICES=GPU-007c9d9a-8177-bd6f-7654-45652102b937]"
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.353Z level=INFO source=sched.go:340 msg="loaded runners" count=1
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.353Z level=INFO source=server.go:432 msg="waiting for llama runner to start responding"
May 04 13:28:01 quorra ollama[538702]: {"function":"server_params_parse","level":"WARN","line":2497,"msg":"server.cpp is not built with verbose logging.","tid":"140639904813056","timestamp":1714829281}
May 04 13:28:01 quorra ollama[538702]: {"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2822,"msg":"build info","tid":"140639904813056","timestamp":1714829281}
May 04 13:28:01 quorra ollama[538702]: {"function":"main","level":"INFO","line":2825,"msg":"system info","n_threads":8,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"140639904813056","timestamp":1714829281,"total_threads":32}
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: loaded meta data with 23 key-value pairs and 254 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-8374dc9da250dffb1ef78505964e8c072fe6688882f93dd72cb870c8a6f0981b (version GGUF V3 (latest))
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv   0:                       general.architecture str              = gemma
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv   1:                               general.name str              = gemma-1.1-7b-it
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv   3:                     gemma.embedding_length u32              = 3072
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv   4:                          gemma.block_count u32              = 28
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 24576
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 16
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 16
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv   8:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv   9:                 gemma.attention.key_length u32              = 256
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv  10:               gemma.attention.value_length u32              = 256
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv  11:                          general.file_type u32              = 1
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 2
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 1
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 3
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - type  f32:   57 tensors
May 04 13:28:01 quorra ollama[537684]: llama_model_loader: - type  f16:  197 tensors
May 04 13:28:01 quorra ollama[537684]: llm_load_vocab: mismatch in special tokens definition ( 416/256000 vs 260/256000 ).
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: format           = GGUF V3 (latest)
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: arch             = gemma
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: vocab type       = SPM
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: n_vocab          = 256000
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: n_merges         = 0
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: n_ctx_train      = 8192
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: n_embd           = 3072
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: n_head           = 16
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: n_head_kv        = 16
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: n_layer          = 28
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: n_rot            = 192
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: n_embd_head_k    = 256
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: n_embd_head_v    = 256
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: n_gqa            = 1
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: n_embd_k_gqa     = 4096
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: n_embd_v_gqa     = 4096
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: f_norm_eps       = 0.0e+00
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: f_logit_scale    = 0.0e+00
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: n_ff             = 24576
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: n_expert         = 0
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: n_expert_used    = 0
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: causal attn      = 1
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: pooling type     = 0
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: rope type        = 2
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: rope scaling     = linear
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: freq_base_train  = 10000.0
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: freq_scale_train = 1
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: n_yarn_orig_ctx  = 8192
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: rope_finetuned   = unknown
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: ssm_d_conv       = 0
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: ssm_d_inner      = 0
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: ssm_d_state      = 0
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: ssm_dt_rank      = 0
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: model type       = 7B
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: model ftype      = F16
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: model params     = 8.54 B
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: model size       = 15.90 GiB (16.00 BPW)
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: general.name     = gemma-1.1-7b-it
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: BOS token        = 2 '<bos>'
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: EOS token        = 1 '<eos>'
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: UNK token        = 3 '<unk>'
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: PAD token        = 0 '<pad>'
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: LF token         = 227 '<0x0A>'
May 04 13:28:01 quorra ollama[537684]: llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
May 04 13:28:01 quorra ollama[537684]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
May 04 13:28:01 quorra ollama[537684]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
May 04 13:28:01 quorra ollama[537684]: ggml_cuda_init: found 1 CUDA devices:
May 04 13:28:01 quorra ollama[537684]:   Device 0: NVIDIA GeForce RTX 4070 Ti SUPER, compute capability 8.9, VMM: yes
May 04 13:28:01 quorra ollama[537684]: time=2024-05-04T13:28:01.603Z level=DEBUG source=server.go:466 msg="server not yet available" error="server not responding"
May 04 13:28:01 quorra ollama[537684]: llm_load_tensors: ggml ctx size =    0.26 MiB
May 04 13:28:01 quorra ollama[537684]: llm_load_tensors: offloading 25 repeating layers to GPU
May 04 13:28:01 quorra ollama[537684]: llm_load_tensors: offloaded 25/29 layers to GPU
May 04 13:28:01 quorra ollama[537684]: llm_load_tensors:        CPU buffer size = 16284.67 MiB
May 04 13:28:01 quorra ollama[537684]: llm_load_tensors:      CUDA0 buffer size = 13200.59 MiB
May 04 13:28:02 quorra ollama[537684]: ......................................................................................
May 04 13:28:02 quorra ollama[537684]: llama_new_context_with_model: n_ctx      = 2048
May 04 13:28:02 quorra ollama[537684]: llama_new_context_with_model: n_batch    = 512
May 04 13:28:02 quorra ollama[537684]: llama_new_context_with_model: n_ubatch   = 512
May 04 13:28:02 quorra ollama[537684]: llama_new_context_with_model: freq_base  = 10000.0
May 04 13:28:02 quorra ollama[537684]: llama_new_context_with_model: freq_scale = 1
May 04 13:28:02 quorra ollama[537684]: llama_kv_cache_init:  CUDA_Host KV buffer size =    96.00 MiB
May 04 13:28:02 quorra ollama[537684]: llama_kv_cache_init:      CUDA0 KV buffer size =   800.00 MiB
May 04 13:28:02 quorra ollama[537684]: llama_new_context_with_model: KV self size  =  896.00 MiB, K (f16):  448.00 MiB, V (f16):  448.00 MiB
May 04 13:28:02 quorra ollama[537684]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.99 MiB
May 04 13:28:02 quorra ollama[537684]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2006.00 MiB on device 0: cudaMalloc failed: out of memory
May 04 13:28:02 quorra ollama[537684]: ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 2103443456
May 04 13:28:02 quorra ollama[537684]: llama_new_context_with_model: failed to allocate compute buffers
May 04 13:28:02 quorra ollama[537684]: llama_init_from_gpt_params: error: failed to create context with model '/usr/share/ollama/.ollama/models/blobs/sha256-8374dc9da250dffb1ef78505964e8c072fe6688882f93dd72cb870c8a6f0981b'
May 04 13:28:02 quorra ollama[538702]: {"function":"load_model","level":"ERR","line":410,"model":"/usr/share/ollama/.ollama/models/blobs/sha256-8374dc9da250dffb1ef78505964e8c072fe6688882f93dd72cb870c8a6f0981b","msg":"unable to load model","tid":"140639904813056","timestamp":1714829282}
May 04 13:28:02 quorra ollama[537684]: time=2024-05-04T13:28:02.808Z level=DEBUG source=server.go:466 msg="server not yet available" error="health resp: Get \"http://127.0.0.1:41879/health\": dial tcp 127.0.0.1:41879: connect: connection refused"
May 04 13:28:08 quorra ollama[537684]: time=2024-05-04T13:28:08.455Z level=INFO source=server.go:437 msg="context expired before server started"
May 04 13:28:08 quorra ollama[537684]: time=2024-05-04T13:28:08.455Z level=ERROR source=sched.go:346 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"
May 04 13:28:08 quorra ollama[537684]: time=2024-05-04T13:28:08.455Z level=DEBUG source=sched.go:349 msg="triggering expiration for failed load" model=/usr/share/ollama/.ollama/models/blobs/sha256-8374dc9da250dffb1ef78505964e8c072fe6688882f93dd72cb870c8a6f0981b
May 04 13:28:08 quorra ollama[537684]: time=2024-05-04T13:28:08.455Z level=DEBUG source=sched.go:265 msg="runner expired event received" model=/usr/share/ollama/.ollama/models/blobs/sha256-8374dc9da250dffb1ef78505964e8c072fe6688882f93dd72cb870c8a6f0981b
May 04 13:28:08 quorra ollama[537684]: time=2024-05-04T13:28:08.455Z level=DEBUG source=sched.go:280 msg="got lock to unload" model=/usr/share/ollama/.ollama/models/blobs/sha256-8374dc9da250dffb1ef78505964e8c072fe6688882f93dd72cb870c8a6f0981b
May 04 13:28:08 quorra ollama[537684]: time=2024-05-04T13:28:08.455Z level=DEBUG source=server.go:895 msg="stopping llama server"
May 04 13:28:08 quorra ollama[537684]: time=2024-05-04T13:28:08.455Z level=DEBUG source=server.go:902 msg="llama server stopped"
May 04 13:28:08 quorra ollama[537684]: [GIN] 2024/05/04 - 13:28:08 | 499 |  7.935904738s |       127.0.0.1 | POST     "/api/chat"
May 04 13:28:08 quorra ollama[537684]: time=2024-05-04T13:28:08.455Z level=DEBUG source=sched.go:285 msg="runner released" model=/usr/share/ollama/.ollama/models/blobs/sha256-8374dc9da250dffb1ef78505964e8c072fe6688882f93dd72cb870c8a6f0981b
May 04 13:28:08 quorra ollama[537684]: time=2024-05-04T13:28:08.455Z level=DEBUG source=sched.go:287 msg="sending an unloaded event" model=/usr/share/ollama/.ollama/models/blobs/sha256-8374dc9da250dffb1ef78505964e8c072fe6688882f93dd72cb870c8a6f0981b
May 04 13:28:08 quorra ollama[537684]: time=2024-05-04T13:28:08.455Z level=DEBUG source=sched.go:215 msg="ignoring unload event with no pending requests"

Sat May  4 13:28:51 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0 Off |                  N/A |
|  0%   33C    P8             10W /  285W |       4MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I am able to load the following models without issues

    "command-r:35b-v0.1-q4_0" # latest, 35b, v0.1, 35b-v0.1-q4_0
    "command-r-plus:104b-q4_0" # 104b, latest, 104b-q4_0
    "dbrx:132b-instruct-q4_0" # 132b, latest, instruct, 132b-instruct-q4_0
    "gemma:7b-instruct-v1.1-q4_0" # latest, 7b, instruct, v1.1, 7b-instruct, 7b-v1.1, 7b-instruct-v1.1-q4_0
    "gemma:2b-instruct-v1.1-q4_0" # 2b, 2b-instruct, 2b-v1.1, 2b-instruct-v1.1-q4_0
    "gemma:2b-instruct-v1.1-fp16"
    "llama2:70b-chat-q4_0" # 7b, latest, chat, 7b-chat, 7b-chat-q4_0
    "llama2:13b-chat-q4_0" # 13b, 13b-chat, 13b-chat-q4_0
    "llama3:8b-instruct-q4_0" # 8b, instruct, latest,, 8b, 8b-instruct-q4_0
    "llama3:8b-instruct-q8_0"
    "llama3:8b-instruct-q6_K"
    "llama3:8b-instruct-fp16"
    "llama3:70b-instruct-q4_0" # 70b, 70b-instruct, 70b-instruct-q4_0
    "llama3-gradient:8b-instruct-1048k-q4_0" # 8b, latest, 1048k, instruct, 8b-instruct-1048k-q4_0
    "llama3-gradient:8b-instruct-1048k-fp16"
    "mistral:7b-instruct-v0.2-q4_0" # 7b, instruct, mistral:7b-instruct, latest, v0.2
    "mistral:7b-instruct-v0.2-fp16"
    "mixtral:8x7b-instruct-v0.1-q4_0" # latest, instruct, 8x7b, 8x7b-instruct-v0.1, 8x7b-instruct-v0.1-q4_0
    "mixtral:8x22b-instruct-v0.1-q4_0" # 8x22b, v0.1, 8x22b-instruct, 8x22b-instruct-v0.1-q4_0, v0.1-
    "neural-chat:7b-v3.3-q4_0" # latest, 7b, 7b-v3.3, 7b-v3.3-q4_0
    "orca-mini:3b-q4_0" # laest, 3b, 3b-q4_0
    "orca-mini:7b-v3-q4_0" # 7b, 7b-v3, 7b-v3-q4_0
    "orca-mini:13b-v3-q4_0" # 13b, 13-v3, 13b-v3-q4_0
    "orca2:7b-q4_0" # latest, 7b, 7b-q4_0
    "orca2:13b-q4_0" # 13b, 13b-q4_0"
    "phi3:3.3.8b-mini-instruct-4k-q4_K_M" # latest, instruct, 3.8b, 3.8b-mini-instruct-4k-q4_K_M
    "phi3:3.8b-mini-insruct-4k-fp16"
    "qwen:4b-chat-v1.5-q4_0" # latest, 4b-chat, 4b,
    "qwen:7b-chat-v1.5-q4_0" # 7b-chat, 7b, 
    "qwen:7b-chat-v1.5-fp16"
    "qwen:72b-chat-v1.5-q4_0" # 72b-chat, 72b
    "qwen:14b-chat-v1.5-q4_0" # 14b-chat, 14b
    "qwen:32b-chat-v1.5-q4_0" # 32b-chat, 32b
    "wizardlm2:7b-q4_0" # 7b, latest, 7b-q4_0
    "wizardlm2:7b-fp16"

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.33

May 04 '24 13:05 MarkWard0110

yes after 0.1.33 release many things have broken

unforunately I think contributors are trying to be so fast that they are unable to test in coverage or write clean quality code

I was very hopeful for the ollama and its community however if this FOMO release cycle continues to break things more I might need to turn back to LiteLLM or other alternatives :'/

May 04 '24 16:05 UmutAlihan

@UmutAlihan we've actually been building out a test farm to better catch these issues before we release, but there are a lot of different permutations to test. Stability is incredibly important to us.

That said, in 0.1.33 we were trying to improve our memory calculation to more efficiently pack in models, and sometimes we weren't calculating enough space and some of the layers were being allocated to the GPU when they should have been allocated to the CPU. The problem is if we're too conservative then performance suffers because more layers get sent to the CPU and there will be a dozen issues with people complaining about slow performance.

Unfortunately I don't have a 4070 Ti Super to test out on. I think what's happening is the model is close to the size of your VRAM and we're not calculating the memory graph correctly w/ gemma. I'll double check with some other people on the team.

May 19 '24 00:05 pdevine

well thank you for detailed response 🫡

I am using 2x 3060s and yes llama3 8b is loading to 24gb vram GPU with around 80% utilization. So I can assume that your root cause analysis is true and hope that more users would prefer stability over performance 🙏

May 19 '24 09:05 UmutAlihan

yes after 0.1.33 release many things have broken

unforunately I think contributors are trying to be so fast that they are unable to test in coverage or write clean quality code

I was very hopeful for the ollama and its community however if this FOMO release cycle continues to break things more I might need to turn back to LiteLLM or other alternatives :'/

100%

May 25 '24 00:05 oldgithubman

I believe this has been resolved. I don't have an exact duplicate test system, but I've tried to load gemma:7b-instruct-v1.1-fp16 on a few systems with similar VRAM setups and the model loads successfully.

One thing I noticed in the opening comment is this error message "error loading llama server" error="timed out waiting for llama runner to start: context canceled" implies the client gave up waiting, so we canceled the load. This could be a result of mmap behavior leading to some I/O thrashing on the system. Since 0.1.33 we've refined our algorithm on when to use mmap vs. normal file reads, so I believe the default should switch to file reads which may speed up the load times. You can also set use_mmap=false in the request or custom model to force mmap to be disabled.

If you're still having trouble loading this model on the latest version, please share an updated server log and I'll reopen the issue.

Jul 25 '24 16:07 dhiltgen

ollama ollama copied to clipboard

v0.1.33 can't load gemma:7b-instruct-v1.1-fp16 due to failed to create context with model

What is the issue?

OS

GPU

CPU

Ollama version

ollama
ollama copied to clipboard