Alpaca only runs on the CPU, despite correct drivers being installed for my RTX 4080 GPU
I have a Medion Beast X40, which has an integrated Intel GPU, and an RTX 4080 GPU.
When running Alpaca, whether I run it with the Discrete GPU or with the Integrated GPU (Using Fedora GNOME's PrefersNonDefaultGPU variable), it defaults to using RAM and GPU to run the models, as opposed to either of my GPUs - preferably, the NVIDIA one.
I have the correct drivers installed, and other apps, such as Steam games or Krita AI Diffusion, are able to correctly recognise and use the NVIDIA GPU.
Expected behavior That there is a way to customise whether I want to run it on the GPU, IntegratedGPU, or CPU. Or if not, that it automatically runs on the most powerful GPU, not the CPU.
Screenshots
Debugging information
aarvi@fedora:~$ flatpak run com.jeffser.Alpaca
INFO [main.py | main] Alpaca version: 7.0.1
MESA-INTEL: warning: ../src/intel/vulkan/anv_formats.c:873: FINISHME: support YUV colorspace with DRM format modifiers
MESA-INTEL: warning: ../src/intel/vulkan/anv_formats.c:905: FINISHME: support more multi-planar formats with DRM modifiers
INFO [ollama_instances.py | start] Starting Alpaca's Ollama instance...
INFO [ollama_instances.py | start] Started Alpaca's Ollama instance
time=2025-07-12T13:14:45.623+01:00 level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES:1 HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11435 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/aarvi/.var/app/com.jeffser.Alpaca/data/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:1 http_proxy: https_proxy: no_proxy:]"
time=2025-07-12T13:14:45.624+01:00 level=INFO source=images.go:476 msg="total blobs: 9"
time=2025-07-12T13:14:45.624+01:00 level=INFO source=images.go:483 msg="total unused blobs removed: 0"
time=2025-07-12T13:14:45.624+01:00 level=INFO source=routes.go:1288 msg="Listening on 127.0.0.1:11435 (version 0.9.3)"
time=2025-07-12T13:14:45.624+01:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
INFO [ollama_instances.py | start] client version is 0.9.3
time=2025-07-12T13:14:45.739+01:00 level=WARN source=cuda_common.go:65 msg="old CUDA driver detected - please upgrade to a newer driver" version=0.0
time=2025-07-12T13:14:45.780+01:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-a1a7f184-3441-f0eb-2522-79f6f904c4eb library=cuda variant=v11 compute=8.9 driver=0.0 name="" total="15.6 GiB" available="15.3 GiB"
[GIN] 2025/07/12 - 13:14:45 | 200 | 238.995µs | 127.0.0.1 | GET "/api/tags"
[GIN] 2025/07/12 - 13:14:45 | 200 | 28.431486ms | 127.0.0.1 | POST "/api/show"
[GIN] 2025/07/12 - 13:14:45 | 200 | 33.498358ms | 127.0.0.1 | POST "/api/show"
[GIN] 2025/07/12 - 13:15:00 | 200 | 37.025598ms | 127.0.0.1 | POST "/api/show"
time=2025-07-12T13:15:00.825+01:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/home/aarvi/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 gpu=GPU-a1a7f184-3441-f0eb-2522-79f6f904c4eb parallel=2 available=16421224448 required="6.5 GiB"
time=2025-07-12T13:15:00.927+01:00 level=INFO source=server.go:135 msg="system memory" total="31.1 GiB" free="23.2 GiB" free_swap="7.7 GiB"
time=2025-07-12T13:15:00.927+01:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[15.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.5 GiB" memory.required.partial="6.5 GiB" memory.required.kv="1.0 GiB" memory.required.allocations="[6.5 GiB]" memory.weights.total="4.3 GiB" memory.weights.repeating="3.9 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /home/aarvi/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
llama_model_loader: - kv 5: general.size_label str = 8B
llama_model_loader: - kv 6: general.license str = llama3.1
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 32
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 15
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 66 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 4.58 GiB (4.89 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 1
print_info: model type = ?B
print_info: model params = 8.03 B
print_info: general.name = Meta Llama 3.1 8B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-07-12T13:15:01.068+01:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/app/plugins/Ollama/bin/ollama runner --model /home/aarvi/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 --ctx-size 8192 --batch-size 512 --n-gpu-layers 33 --threads 8 --parallel 2 --port 33039"
time=2025-07-12T13:15:01.068+01:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-07-12T13:15:01.068+01:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-07-12T13:15:01.069+01:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-07-12T13:15:01.075+01:00 level=INFO source=runner.go:815 msg="starting go runner"
load_backend: loaded CPU backend from /app/plugins/Ollama/lib/ollama/libggml-cpu-alderlake.so
time=2025-07-12T13:15:01.077+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-07-12T13:15:01.077+01:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:33039"
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /home/aarvi/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
llama_model_loader: - kv 5: general.size_label str = 8B
llama_model_loader: - kv 6: general.license str = llama3.1
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 32
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 15
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 66 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 4.58 GiB (4.89 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 4096
print_info: n_layer = 32
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 14336
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 8B
print_info: model params = 8.03 B
print_info: general.name = Meta Llama 3.1 8B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
time=2025-07-12T13:15:01.320+01:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
load_tensors: CPU_Mapped model buffer size = 4685.30 MiB
llama_context: constructing llama_context
llama_context: n_seq_max = 2
llama_context: n_ctx = 8192
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 1024
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 1.01 MiB
llama_kv_cache_unified: kv_size = 8192, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
llama_kv_cache_unified: CPU KV buffer size = 1024.00 MiB
llama_kv_cache_unified: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_context: CPU compute buffer size = 560.01 MiB
llama_context: graph nodes = 1094
llama_context: graph splits = 1
time=2025-07-12T13:15:02.073+01:00 level=INFO source=server.go:637 msg="llama runner started in 1.00 seconds"
time=2025-07-12T13:15:02.505+01:00 level=INFO source=sched.go:548 msg="updated VRAM based on existing loaded models" gpu=GPU-a1a7f184-3441-f0eb-2522-79f6f904c4eb library=cuda total="15.6 GiB" available="9.1 GiB"
[GIN] 2025/07/12 - 13:15:03 | 200 | 3.142017075s | 127.0.0.1 | POST "/api/generate"
time=2025-07-12T13:15:08.903+01:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.107856878 runner.size="6.5 GiB" runner.vram="6.5 GiB" runner.parallel=2 runner.pid=124 runner.model=/home/aarvi/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29
time=2025-07-12T13:15:09.144+01:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/home/aarvi/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-19718c32a55ea4e4f0005adbe0d818f3eaa672eeecfdd9c28e393e2c04d18f51 gpu=GPU-a1a7f184-3441-f0eb-2522-79f6f904c4eb parallel=1 available=16421224448 required="12.9 GiB"
time=2025-07-12T13:15:09.153+01:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.357603679 runner.size="6.5 GiB" runner.vram="6.5 GiB" runner.parallel=2 runner.pid=124 runner.model=/home/aarvi/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29
time=2025-07-12T13:15:09.243+01:00 level=INFO source=server.go:135 msg="system memory" total="31.1 GiB" free="23.1 GiB" free_swap="7.7 GiB"
time=2025-07-12T13:15:09.243+01:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[15.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="12.9 GiB" memory.required.partial="12.9 GiB" memory.required.kv="3.0 GiB" memory.required.allocations="[12.9 GiB]" memory.weights.total="8.0 GiB" memory.weights.repeating="7.4 GiB" memory.weights.nonrepeating="607.5 MiB" memory.graph.full="1.3 GiB" memory.graph.partial="1.6 GiB"
llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from /home/aarvi/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-19718c32a55ea4e4f0005adbe0d818f3eaa672eeecfdd9c28e393e2c04d18f51 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Cogito v1 Preview Qwen 14B
llama_model_loader: - kv 3: general.basename str = cogito-v1-preview-qwen
llama_model_loader: - kv 4: general.size_label str = 14B
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: general.base_model.count u32 = 1
llama_model_loader: - kv 7: general.base_model.0.name str = Qwen2.5 14B
llama_model_loader: - kv 8: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 9: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-14B
llama_model_loader: - kv 10: general.tags arr[str,1] = ["text-generation"]
llama_model_loader: - kv 11: qwen2.block_count u32 = 48
llama_model_loader: - kv 12: qwen2.context_length u32 = 131072
llama_model_loader: - kv 13: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 14: qwen2.feed_forward_length u32 = 13824
llama_model_loader: - kv 15: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 16: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 17: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 18: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 20: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,151665] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,151665] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if not enable_thinking is defined...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - kv 30: general.file_type u32 = 15
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type q4_K: 289 tensors
llama_model_loader: - type q6_K: 49 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 8.36 GiB (4.86 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 1
print_info: model type = ?B
print_info: model params = 14.77 B
print_info: general.name = Cogito v1 Preview Qwen 14B
print_info: vocab type = BPE
print_info: n_vocab = 151665
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-07-12T13:15:09.348+01:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/app/plugins/Ollama/bin/ollama runner --model /home/aarvi/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-19718c32a55ea4e4f0005adbe0d818f3eaa672eeecfdd9c28e393e2c04d18f51 --ctx-size 16384 --batch-size 512 --n-gpu-layers 49 --threads 8 --parallel 1 --port 39141"
time=2025-07-12T13:15:09.348+01:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-07-12T13:15:09.348+01:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-07-12T13:15:09.349+01:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-07-12T13:15:09.358+01:00 level=INFO source=runner.go:815 msg="starting go runner"
load_backend: loaded CPU backend from /app/plugins/Ollama/lib/ollama/libggml-cpu-alderlake.so
time=2025-07-12T13:15:09.361+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-07-12T13:15:09.361+01:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:39141"
llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from /home/aarvi/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-19718c32a55ea4e4f0005adbe0d818f3eaa672eeecfdd9c28e393e2c04d18f51 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Cogito v1 Preview Qwen 14B
llama_model_loader: - kv 3: general.basename str = cogito-v1-preview-qwen
llama_model_loader: - kv 4: general.size_label str = 14B
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: general.base_model.count u32 = 1
llama_model_loader: - kv 7: general.base_model.0.name str = Qwen2.5 14B
llama_model_loader: - kv 8: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 9: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-14B
llama_model_loader: - kv 10: general.tags arr[str,1] = ["text-generation"]
llama_model_loader: - kv 11: qwen2.block_count u32 = 48
llama_model_loader: - kv 12: qwen2.context_length u32 = 131072
llama_model_loader: - kv 13: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 14: qwen2.feed_forward_length u32 = 13824
llama_model_loader: - kv 15: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 16: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 17: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 18: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 20: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,151665] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,151665] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2025-07-12T13:15:09.403+01:00 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.60788413 runner.size="6.5 GiB" runner.vram="6.5 GiB" runner.parallel=2 runner.pid=124 runner.model=/home/aarvi/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29
llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 27: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 28: tokenizer.chat_template str = {%- if not enable_thinking is defined...
llama_model_loader: - kv 29: general.quantization_version u32 = 2
llama_model_loader: - kv 30: general.file_type u32 = 15
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type q4_K: 289 tensors
llama_model_loader: - type q6_K: 49 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 8.36 GiB (4.86 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 5120
print_info: n_layer = 48
print_info: n_head = 40
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 5
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 13824
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 14B
print_info: model params = 14.77 B
print_info: general.name = Cogito v1 Preview Qwen 14B
print_info: vocab type = BPE
print_info: n_vocab = 151665
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
time=2025-07-12T13:15:09.600+01:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
load_tensors: CPU_Mapped model buffer size = 8563.35 MiB
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.60 MiB
llama_kv_cache_unified: kv_size = 16384, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1, padding = 32
llama_kv_cache_unified: CPU KV buffer size = 3072.00 MiB
llama_kv_cache_unified: KV self size = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
llama_context: CPU compute buffer size = 1352.01 MiB
llama_context: graph nodes = 1782
llama_context: graph splits = 1
time=2025-07-12T13:15:10.605+01:00 level=INFO source=server.go:637 msg="llama runner started in 1.26 seconds"
(python3:2): Gtk-CRITICAL **: 13:15:11.267: gtk_text_attributes_ref: assertion 'values != NULL' failed
aarvi@fedora:~$
Is this - possibly - a duplicate of #739?
My apologies. I did try looking, but I didn't see that thread. The takeaway is that this is fixed and will be pushed to production in the next update? Anyway, thanks, I guess this can be closed as a duplicate.
On 12 July 2025 15:08:47 BST, mags0ft @.***> wrote:
mags0ft left a comment (Jeffser/Alpaca#878)
Is this - possibly - a duplicate of #739?
-- Reply to this email directly or view it on GitHub: https://github.com/Jeffser/Alpaca/issues/878#issuecomment-3065555059 You are receiving this because you authored the thread.
Message ID: @.***>
Maybe I'm missing something, so it'd be good to have a third pair of eyes look at this. Could be unrelated, but it doesn't look like it. I think it's fine to leave this issue open until someone else voices an opinion.
Hi the fix was pushed in the last version, could you check if it work?
Is this the version that was released on Flathub a couple of days ago? It didn't use my GPU in initial testing, but maybe I need to do something extra, or reinstall the app or something.
On 21 July 2025 03:07:05 BST, Jeffry Samuel @.***> wrote:
Jeffser left a comment (Jeffser/Alpaca#878)
Hi the fix was pushed in the last version, could you check if it work?
-- Reply to this email directly or view it on GitHub: https://github.com/Jeffser/Alpaca/issues/878#issuecomment-3095031947 You are receiving this because you authored the thread.
Message ID: @.***>
Yes, it still does not use the GPU in the latest version, even after a complete reinstall.
You probably need to remove all the env variables from the instance like it was done here
https://github.com/Jeffser/Alpaca/issues/739
After doing that, it does occasionally use the GPU, but more often it just crashes all the time or give the error message
'NoneType' object has no attribute 'append_content'
On Thu, Jul 24 2025 at 15:16:29 -07:00:00, Jeffry Samuel @.***> wrote:
Jeffser left a comment (Jeffser/Alpaca#878) https://github.com/Jeffser/Alpaca/issues/878#issuecomment-3115138685 You probably need to remove all the env variables from the instance like it was done here
#739 https://github.com/Jeffser/Alpaca/issues/739
— Reply to this email directly, view it on GitHub https://github.com/Jeffser/Alpaca/issues/878#issuecomment-3115138685, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUGAFJK7IKBS53AGS35LJOD3KFLL3AVCNFSM6AAAAACBLV2RHCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCMJVGEZTQNRYGU. You are receiving this because you authored the thread.Message ID: @.***>
yeah that's an unrelated issue that should be fixed with the next update that is being uploaded to Flathub right now
once you have 7.5.3 could you check if everything works and close the issue if it does? thanks!
Unfortunately after updating it still crashes, and when restarting it comes back with a new error:
HTTPConnectionPool(host='0.0.0.0', port=11434): Max retries exceeded with url: /api/tags (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9f5ffdb470>: Failed to establish a new connection: [Errno 111] Connection refused'))
On 25 July 2025 20:23:39 BST, Jeffry Samuel @.***> wrote:
Jeffser left a comment (Jeffser/Alpaca#878) https://github.com/Jeffser/Alpaca/issues/878#issuecomment-3120071010 yeah that's an unrelated issue that should be fixed with the next update that is being uploaded to Flathub right now
once you have 7.5.3 could you check if everything works and close the issue if it does? thanks!
— Reply to this email directly, view it on GitHub https://github.com/Jeffser/Alpaca/issues/878#issuecomment-3120071010, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUGAFJPK6DCNQ32BVN7GZNL3KJ73XAVCNFSM6AAAAACBLV2RHCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCMRQGA3TCMBRGA. You are receiving this because you authored the thread.Message ID: @.***>
Ok, it seems to have stopped crashing for now. If that behaviour persists, I will close this issue.
Update, it still crashes, whenever I regenerate an answer. Otherwise it is fairly stable.