Deepseek 671B unable to run locally (Flatpak)
Hi,
I encountered the following error when I try to run DeepSeek 671B on my system.
user@fedora:~$ flatpak run com.jeffser.Alpaca INFO [main.py | main] Alpaca version: 4.0.0 INFO [connection_handler.py | start] Starting Alpaca's Ollama instance... INFO [connection_handler.py | start] Started Alpaca's Ollama instance INFO [connection_handler.py | start] client version is 0.5.7 ERROR [window.py | run_message] ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) Exception in thread Thread-5 (run_message): Traceback (most recent call last): File "/app/lib/python3.12/site-packages/urllib3/connectionpool.py", line 793, in urlopen ERROR [window.py | generate_chat_title] ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) response = self._make_request( ^^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.12/site-packages/urllib3/connectionpool.py", line 537, in _make_request response = conn.getresponse() ^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.12/site-packages/urllib3/connection.py", line 466, in getresponse httplib_response = super().getresponse() ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse response.begin() File "/usr/lib/python3.12/http/client.py", line 331, in begin version, status, reason = self._read_status() ^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/http/client.py", line 300, in _read_status raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/app/lib/python3.12/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( ^^^^^^^^^^^^^ File "/app/lib/python3.12/site-packages/urllib3/connectionpool.py", line 847, in urlopen retries = retries.increment( ^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.12/site-packages/urllib3/util/retry.py", line 470, in increment raise reraise(type(error), error, _stacktrace) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.12/site-packages/urllib3/util/util.py", line 38, in reraise raise value.with_traceback(tb) File "/app/lib/python3.12/site-packages/urllib3/connectionpool.py", line 793, in urlopen response = self._make_request( ^^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.12/site-packages/urllib3/connectionpool.py", line 537, in _make_request response = conn.getresponse() ^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.12/site-packages/urllib3/connection.py", line 466, in getresponse httplib_response = super().getresponse() ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse response.begin() File "/usr/lib/python3.12/http/client.py", line 331, in begin version, status, reason = self._read_status() ^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/http/client.py", line 300, in _read_status raise RemoteDisconnected("Remote end closed connection without" urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/app/share/Alpaca/alpaca/window.py", line 670, in run_message response = self.ollama_instance.request("POST", "api/chat", json.dumps(data), lambda data, message_element=message_element: message_element.update_message(data)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/share/Alpaca/alpaca/connection_handler.py", line 82, in request response = requests.post(connection_url, headers=self.get_headers(True), data=data, stream=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.12/site-packages/requests/api.py", line 115, in post return request("post", url, data=data, json=json, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.12/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.12/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.12/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/lib/python3.12/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner self.run() File "/usr/lib/python3.12/threading.py", line 1012, in run self._target(*self._args, **self._kwargs) File "/app/share/Alpaca/alpaca/window.py", line 675, in run_message raise Exception(e) Exception: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
I am using the integrated Ollama instance which was shown Running. No changes or modifications were made in the Ollama Instance section.
System Specifications: GPU: 4090 RAM: 768GB OS: Fedora 41 Gnome
I tested with a smaller model (Qwen2 72B) and has no issue generating a response with no errors. This may be due to it fitting into my 4090 (99% utilization) and not spilling over to system RAM whereas Deepseek 671B cannot.
Is there a way to disable loading of models into VRAM and into system RAM only to test this?
Sorry, just noticed there was a debugger function. Please refer to below.
Couldn't find '/home/user/.ollama/id_ed25519'. Generating new private key. Your new public key is:
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIObIEWaCEq49QSa3EgMEFudE9WqAhyBh9rfrPK6Zt/XX
2025/02/01 15:35:36 routes.go:1187: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11435 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/user/.var/app/com.jeffser.Alpaca/data/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-02-01T15:35:36.462+08:00 level=INFO source=images.go:432 msg="total blobs: 11" time=2025-02-01T15:35:36.462+08:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
- using env: export GIN_MODE=release
- using code: gin.SetMode(gin.ReleaseMode)
time=2025-02-01T15:35:36.462+08:00 level=INFO source=routes.go:1238 msg="Listening on 127.0.0.1:11435 (version 0.5.7)" [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2025-02-01T15:35:36.463+08:00 level=INFO source=routes.go:1267 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx]" time=2025-02-01T15:35:36.463+08:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs" time=2025-02-01T15:35:36.798+08:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-aba95439-9f10-0dc7-c0e8-0c959db9b0a5 library=cuda variant=v11 compute=8.9 driver=0.0 name="" total="23.5 GiB" available="22.7 GiB" [GIN] 2025/02/01 - 15:35:36 | 200 | 408.876µs | 127.0.0.1 | GET "/api/tags" [GIN] 2025/02/01 - 15:35:36 | 200 | 20.412271ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/02/01 - 15:35:36 | 200 | 21.982926ms | 127.0.0.1 | POST "/api/show" time=2025-02-01T15:35:48.951+08:00 level=INFO source=server.go:104 msg="system memory" total="754.9 GiB" free="746.7 GiB" free_swap="8.0 GiB" time=2025-02-01T15:35:48.951+08:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=62 layers.offload=5 layers.split="" memory.available="[22.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="415.7 GiB" memory.required.partial="17.6 GiB" memory.required.kv="9.5 GiB" memory.required.allocations="[17.6 GiB]" memory.weights.total="385.0 GiB" memory.weights.repeating="384.3 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="654.0 MiB" memory.graph.partial="1019.5 MiB" time=2025-02-01T15:35:48.952+08:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/app/lib/ollama/runners/cuda_v11_avx/ollama_llama_server runner --model /home/user/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 --ctx-size 2048 --batch-size 512 --n-gpu-layers 5 --threads 96 --parallel 1 --port 41315" time=2025-02-01T15:35:48.960+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-02-01T15:35:48.960+08:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-02-01T15:35:48.960+08:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-02-01T15:35:48.991+08:00 level=INFO source=runner.go:936 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes time=2025-02-01T15:35:49.040+08:00 level=INFO source=runner.go:937 msg=system info="CUDA : USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=96 time=2025-02-01T15:35:49.040+08:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:41315" time=2025-02-01T15:35:49.212+08:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 22986 MiB free llama_model_loader: loaded meta data with 42 key-value pairs and 1025 tensors from /home/user/.var/app/com.jeffser.Alpaca/data/.ollama/models/blobs/sha256-9801e7fce27dbf3d0bfb468b7b21f1d132131a546dfc43e50518631b8b1800a9 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.size_label str = 256x20B llama_model_loader: - kv 3: deepseek2.block_count u32 = 61 llama_model_loader: - kv 4: deepseek2.context_length u32 = 163840 llama_model_loader: - kv 5: deepseek2.embedding_length u32 = 7168 llama_model_loader: - kv 6: deepseek2.feed_forward_length u32 = 18432 llama_model_loader: - kv 7: deepseek2.attention.head_count u32 = 128 llama_model_loader: - kv 8: deepseek2.attention.head_count_kv u32 = 128 llama_model_loader: - kv 9: deepseek2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 10: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 11: deepseek2.expert_used_count u32 = 8 llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 3 llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 129280 llama_model_loader: - kv 14: deepseek2.attention.q_lora_rank u32 = 1536 llama_model_loader: - kv 15: deepseek2.attention.kv_lora_rank u32 = 512 llama_model_loader: - kv 16: deepseek2.attention.key_length u32 = 192 llama_model_loader: - kv 17: deepseek2.attention.value_length u32 = 128 llama_model_loader: - kv 18: deepseek2.expert_feed_forward_length u32 = 2048 llama_model_loader: - kv 19: deepseek2.expert_count u32 = 256 llama_model_loader: - kv 20: deepseek2.expert_shared_count u32 = 1 llama_model_loader: - kv 21: deepseek2.expert_weights_scale f32 = 2.500000 llama_model_loader: - kv 22: deepseek2.expert_weights_norm bool = true llama_model_loader: - kv 23: deepseek2.expert_gating_func u32 = 2 llama_model_loader: - kv 24: deepseek2.rope.dimension_count u32 = 64 llama_model_loader: - kv 25: deepseek2.rope.scaling.type str = yarn llama_model_loader: - kv 26: deepseek2.rope.scaling.factor f32 = 40.000000 llama_model_loader: - kv 27: deepseek2.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 28: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 llama_model_loader: - kv 29: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 30: tokenizer.ggml.pre str = deepseek-v3
can you tell about the system configuration that you have
Sure.
System Specifications: CPU: 7995wx GPU: 4090 RAM: 768GB OS: Fedora 41 Gnome
I also tested with Deepseek 70b which is about 40GB and I can run it successfully with my VRAM (20GB to 22GB used) and the remainder overflow properly to my system RAM (~22GB).
This is with Llama3.3 70b which is about 75GB.
Deepseek 671b is only about 400GB which should still be manageable under my RAM capacity.
Have you tried with ollama directly.
Have you tried with ollama directly?
would interest me as well
A feature to disable CPU fallback (GPU only) or force CPU-only (globally / each model) usage would be handy. Occasionally, the app partially loads models into VRAM, then fails (until app restart clears the VRAM) and switches to CPU, requiring manual termination repeatedly. Not tested the latest releases in that regard)
It seems that Alpaca is crashing, not ollama. When ollama encounters an out of memory error (which usually causes the
Exception: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) empty responses), its written explicitly in the logs when you exclusively try to use GPU vram only or when you try to use GPU+VRAM. Did you include the full logs? Nothing cut at the end?