Ollama Linux seg fault with GPU on Ubuntu 22.04
Ran into seg faults trying to run Ollama on Ubuntu 22.04. This is with an Intel Arc A750 card.
A few searches showed a similar finigerprint in https://github.com/intel/compute-runtime/issues/710. Reporting this as similar workaround of:
export NEOReadDebugKeys=1 export OverrideGpuAddressSpace=48
does the trick to get things working. Did a: pip install --pre --upgrade ipex-llm[cpp]
just today, still not getting versions that have this fixed, so maybe coming soon, but wanted to raise failure fingerprint and workaround.
Hi @doucej , could you please provide more information from the ollama server side (like the ollama server log)? This would be helpful for us in addressing the issue and fixing it.
Also, would you mind running this https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/scripts to check system environment and reply us the output?
Sure -- below is failure log, starting via:
export OLLAMA_NUM_GPU=999 export no_proxy=localhost,127.0.0.1 export ZES_ENABLE_SYSMAN=1 source /opt/intel/oneapi/setvars.sh export SYCL_CACHE_PERSISTENT=1 export OLLAMA_HOST=0.0.0.0:11434
2024/06/11 10:46:30 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]" time=2024-06-11T10:46:30.108-04:00 level=INFO source=images.go:729 msg="total blobs: 16" time=2024-06-11T10:46:30.108-04:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
- using env: export GIN_MODE=release
- using code: gin.SetMode(gin.ReleaseMode)
[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-06-11T10:46:30.109-04:00 level=INFO source=routes.go:1074 msg="Listening on [::]:11434 (version 0.0.0)" time=2024-06-11T10:46:30.109-04:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama3171358111/runners time=2024-06-11T10:46:30.185-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]" time=2024-06-11T10:46:30.197-04:00 level=INFO source=types.go:71 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="39.1 GiB" available="21.5 GiB" [GIN] 2024/06/11 - 10:46:43 | 200 | 89.1µs | 127.0.0.1 | HEAD "/" [GIN] 2024/06/11 - 10:46:43 | 200 | 1.165584ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/06/11 - 10:46:43 | 200 | 754.647µs | 127.0.0.1 | POST "/api/show" time=2024-06-11T10:46:46.057-04:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="21.5 GiB" memory.required.full="4.6 GiB" memory.required.partial="4.6 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" time=2024-06-11T10:46:46.058-04:00 level=INFO source=server.go:342 msg="starting llama server" cmd="/tmp/ollama3171358111/runners/cpu_avx2/ollama_llama_server --model /home/doucej/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 38351" time=2024-06-11T10:46:46.059-04:00 level=INFO source=sched.go:338 msg="loaded runners" count=1 time=2024-06-11T10:46:46.059-04:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding" time=2024-06-11T10:46:46.059-04:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="1e71e4c" tid="127747830650880" timestamp=1718117206 INFO [main] system info | n_threads=22 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="127747830650880" timestamp=1718117206 total_threads=44 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="43" port="38351" tid="127747830650880" timestamp=1718117206 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/doucej/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-06-11T10:46:46.311-04:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' [SYCL] call ggml_init_sycl ggml_init_sycl: GGML_SYCL_DEBUG: 0 ggml_init_sycl: GGML_SYCL_F16: no found 3 SYCL devices: | | | | |Max | |Max |Global | | | | | | |compute|Max work|sub |mem | |
| ID | Device Type | Name | Version | units | group | group | size | Driver version |
|---|---|---|---|---|---|---|---|---|
| 0 | [opencl:gpu:0] | Intel Arc A750 Graphics | 3.0 | 448 | 1024 | 32 | 8096M | 23.43.027642 |
| 1 | [opencl:cpu:0] | Intel Xeon CPU E5-2696 v4 @ 2.20GHz | 3.0 | 44 | 8192 | 64 | 41964M | 2024.17.3.0.08_160000 |
| 2 | [opencl:acc:0] | Intel FPGA Emulation Device | 1.2 | 44 | 67108864 | 64 | 41964M | 2024.17.3.0.08_160000 |
| ggml_backend_sycl_set_mul_device_mode: true | ||||||||
| llama_model_load: error loading model: DeviceList is empty. -30 (PI_ERROR_INVALID_VALUE) | ||||||||
| llama_load_model_from_file: exception loading model | ||||||||
| terminate called after throwing an instance of 'sycl::_V1::invalid_parameter_error' | ||||||||
| what(): DeviceList is empty. -30 (PI_ERROR_INVALID_VALUE) | ||||||||
| time=2024-06-11T10:46:47.516-04:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server not responding" | ||||||||
| time=2024-06-11T10:46:48.295-04:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error" | ||||||||
| time=2024-06-11T10:46:48.546-04:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) " | ||||||||
| [GIN] 2024/06/11 - 10:46:48 | 500 | 5.338034878s | 127.0.0.1 | POST "/api/chat" |
Connected to server through "ollama run llama3"
Adding: export NEOReadDebugKeys=1 export OverrideGpuAddressSpace=48
I get:
2024/06/11 10:49:51 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]" time=2024-06-11T10:49:51.435-04:00 level=INFO source=images.go:729 msg="total blobs: 16" time=2024-06-11T10:49:51.436-04:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
- using env: export GIN_MODE=release
- using code: gin.SetMode(gin.ReleaseMode)
[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-06-11T10:49:51.436-04:00 level=INFO source=routes.go:1074 msg="Listening on [::]:11434 (version 0.0.0)" time=2024-06-11T10:49:51.437-04:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2342145819/runners time=2024-06-11T10:49:51.529-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]" time=2024-06-11T10:49:51.546-04:00 level=INFO source=types.go:71 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="39.1 GiB" available="21.4 GiB" [GIN] 2024/06/11 - 10:50:00 | 200 | 71.283µs | 127.0.0.1 | HEAD "/" [GIN] 2024/06/11 - 10:50:00 | 200 | 1.271359ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/06/11 - 10:50:00 | 200 | 984.838µs | 127.0.0.1 | POST "/api/show" time=2024-06-11T10:50:03.871-04:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="21.4 GiB" memory.required.full="4.6 GiB" memory.required.partial="4.6 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" time=2024-06-11T10:50:03.871-04:00 level=INFO source=server.go:342 msg="starting llama server" cmd="/tmp/ollama2342145819/runners/cpu_avx2/ollama_llama_server --model /home/doucej/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 37773" time=2024-06-11T10:50:03.872-04:00 level=INFO source=sched.go:338 msg="loaded runners" count=1 time=2024-06-11T10:50:03.872-04:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding" time=2024-06-11T10:50:03.873-04:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="1e71e4c" tid="138128694257664" timestamp=1718117403 INFO [main] system info | n_threads=22 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="138128694257664" timestamp=1718117403 total_threads=44 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="43" port="37773" tid="138128694257664" timestamp=1718117403 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/doucej/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-06-11T10:50:04.124-04:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' [SYCL] call ggml_init_sycl ggml_init_sycl: GGML_SYCL_DEBUG: 0 ggml_init_sycl: GGML_SYCL_F16: no found 4 SYCL devices: | | | | |Max | |Max |Global | | | | | | |compute|Max work|sub |mem | |
| ID | Device Type | Name | Version | units | group | group | size | Driver version |
|---|---|---|---|---|---|---|---|---|
| 0 | [level_zero:gpu:0] | Intel Arc A750 Graphics | 1.3 | 448 | 1024 | 32 | 8096M | 1.3.26241 |
| 1 | [opencl:gpu:0] | Intel Arc A750 Graphics | 3.0 | 448 | 1024 | 32 | 8096M | 23.43.027642 |
| 2 | [opencl:cpu:0] | Intel Xeon CPU E5-2696 v4 @ 2.20GHz | 3.0 | 44 | 8192 | 64 | 41964M | 2024.17.3.0.08_160000 |
| 3 | [opencl:acc:0] | Intel FPGA Emulation Device | 1.2 | 44 | 67108864 | 64 | 41964M | 2024.17.3.0.08_160000 |
| ggml_backend_sycl_set_mul_device_mode: true | ||||||||
| detect 1 SYCL GPUs: [0] with top Max compute units:448 | ||||||||
| llm_load_tensors: ggml ctx size = 0.30 MiB | ||||||||
| llm_load_tensors: offloading 32 repeating layers to GPU | ||||||||
| llm_load_tensors: offloading non-repeating layers to GPU | ||||||||
| llm_load_tensors: offloaded 33/33 layers to GPU | ||||||||
| llm_load_tensors: SYCL0 buffer size = 4155.99 MiB | ||||||||
| llm_load_tensors: CPU buffer size = 281.81 MiB | ||||||||
| llama_new_context_with_model: n_ctx = 2048 | ||||||||
| llama_new_context_with_model: n_batch = 512 | ||||||||
| llama_new_context_with_model: n_ubatch = 512 | ||||||||
| llama_new_context_with_model: flash_attn = 0 | ||||||||
| llama_new_context_with_model: freq_base = 500000.0 | ||||||||
| llama_new_context_with_model: freq_scale = 1 | ||||||||
| llama_kv_cache_init: SYCL0 KV buffer size = 256.00 MiB | ||||||||
| llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB | ||||||||
| llama_new_context_with_model: SYCL_Host output buffer size = 0.50 MiB | ||||||||
| [1718117411] warming up the model with an empty run | ||||||||
| llama_new_context_with_model: SYCL0 compute buffer size = 258.50 MiB | ||||||||
| llama_new_context_with_model: SYCL_Host compute buffer size = 12.01 MiB | ||||||||
| llama_new_context_with_model: graph nodes = 1062 | ||||||||
| llama_new_context_with_model: graph splits = 2 | ||||||||
| INFO [main] model loaded | tid="138128694257664" timestamp=1718117412 | |||||||
| time=2024-06-11T10:50:12.928-04:00 level=INFO source=server.go:571 msg="llama runner started in 9.06 seconds" | ||||||||
| [GIN] 2024/06/11 - 10:50:12 | 200 | 11.976264811s | 127.0.0.1 | POST "/api/chat" |
Ollama is up and correctly responding to requests as of this point.
hi @doucej, this issue is due to oneAPI not being installed correctly. You may run ollama with following steps:
-
Please run
sycl-lsto check your sycl devices. The expected output should be as below:[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix] [opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 5 125H OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix] [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO [24.09.28717.12] [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) Graphics 1.3 [1.3.28717]If
[ext_oneapi_level_zero:gpu]is not present, please proceed to step 2. -
You may follow our guide to reinstall oneAPI: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id1.
Thanks -- yes, I do have my GPU listed:
(base) doucej@kryten:~$ source /opt/intel/oneapi/setvars.sh
:: initializing oneAPI environment ...
bash: BASH_VERSION = 5.2.21(1)-release
args: Using "$@" for setvars.sh arguments:
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vtune -- latest
:: oneAPI environment initialized ::
(base) doucej@kryten:~$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.3.0.08_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A750 Graphics OpenCL 3.0 NEO [23.43.027642]
(base) doucej@kryten:~$
I did try creating a new conda environment today through:
conda create -n llm-cpp python=3.11
conda activate llm-cpp
pip install --pre --upgrade ipex-llm[cpp]
And followed through with deleting the old ollama binary and an init-ollama.
Back to the same point, need those extra environment variables for it to see/utilize the GPU without crashing.
When installing as above, do I also need to go through and install/update the oneapi packages through apt, or is above PIP getting that set up for me? I could try that if this still seems like a setup/installation issue.
Thanks!
Hi @doucej , this failure of Ollama to run is because you haven't correctly installed OneAPI. If you have installed OneAPI correctly, [ext_oneapi_level_zero:gpu] should be present as expected in the output of sycl-ls.
To fix this issue, you may follow our guide to reinstall oneAPI: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id1.
Also, would you mind running https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/scripts to check system environment and reply us the output?
Some notes below. I did go through and try to reinstall oneAPI, looks like indeed I had an old version on there from previous rev ubuntu most likely.
Some notes:
Before touching anything, after activating llm-cpp conda, source /opt/intel/oneapi/setvars.sh:
(llm-cpp) doucej@kryten:~$ ~/Downloads/env-check.sh
-----------------------------------------------------------------
PYTHON_VERSION=3.11.9
-----------------------------------------------------------------
transformers=4.41.2
-----------------------------------------------------------------
torch=2.2.0+cu121
-----------------------------------------------------------------
ipex-llm Version: 2.1.0b20240620
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 44
On-line CPU(s) list: 0-43
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz
CPU family: 6
Model: 79
Thread(s) per core: 2
Core(s) per socket: 22
Socket(s): 1
Stepping: 1
CPU(s) scaling MHz: 55%
CPU max MHz: 3700.0000
CPU min MHz: 1200.0000
-----------------------------------------------------------------
Total CPU Memory: 39.0826 GB
-----------------------------------------------------------------
Operating System:
Ubuntu 24.04 LTS \n \l
-----------------------------------------------------------------
Linux kryten 6.8.0-35-generic #35-Ubuntu SMP PREEMPT_DYNAMIC Mon May 20 15:51:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
/home/doucej/Downloads/env-check.sh: line 148: xpu-smi: command not found
-----------------------------------------------------------------
/home/doucej/Downloads/env-check.sh: line 154: clinfo: command not found
-----------------------------------------------------------------
Driver related package version:
rc intel-fw-gpu 2022.47.1+190 all Firmware package for Intel integrated and discrete GPUs
ii intel-level-zero-gpu 1.3.26241.33-647~22.04 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is not installed. Please install xpu-smi according to README.md
(llm-cpp) doucej@kryten:~$
after:
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install intel-oneapi-common-vars=2024.0.0-49406
intel-oneapi-common-oneapi-vars=2024.0.0-49406
intel-oneapi-diagnostics-utility=2024.0.0-49093
intel-oneapi-compiler-dpcpp-cpp=2024.0.2-49895
intel-oneapi-dpcpp-ct=2024.0.0-49381
intel-oneapi-mkl=2024.0.0-49656
intel-oneapi-mkl-devel=2024.0.0-49656
intel-oneapi-mpi=2021.11.0-49493
intel-oneapi-mpi-devel=2021.11.0-49493
intel-oneapi-dal=2024.0.1-25
intel-oneapi-dal-devel=2024.0.1-25
intel-oneapi-ippcp=2021.9.1-5
intel-oneapi-ippcp-devel=2021.9.1-5
intel-oneapi-ipp=2021.10.1-13
intel-oneapi-ipp-devel=2021.10.1-13
intel-oneapi-tlt=2024.0.0-352
intel-oneapi-ccl=2021.11.2-5
intel-oneapi-ccl-devel=2021.11.2-5
intel-oneapi-dnnl-devel=2024.0.0-49521
intel-oneapi-dnnl=2024.0.0-49521
intel-oneapi-tcm-1.0=1.0.0-435
Let apt upgrade a bunch of intel-oneapi packages after initial install & reboot for good measure:
(ollama) doucej@kryten:~$ ~/Downloads/env-check.sh
-----------------------------------------------------------------
PYTHON_VERSION=3.11.9
-----------------------------------------------------------------
transformers=4.41.2
-----------------------------------------------------------------
torch=2.2.0+cu121
-----------------------------------------------------------------
ipex-llm Version: 2.1.0b20240624
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 44
On-line CPU(s) list: 0-43
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz
CPU family: 6
Model: 79
Thread(s) per core: 2
Core(s) per socket: 22
Socket(s): 1
Stepping: 1
CPU(s) scaling MHz: 55%
CPU max MHz: 3700.0000
CPU min MHz: 1200.0000
-----------------------------------------------------------------
Total CPU Memory: 39.0826 GB
-----------------------------------------------------------------
Operating System:
Ubuntu 24.04 LTS \n \l
-----------------------------------------------------------------
Linux kryten 6.8.0-35-generic #35-Ubuntu SMP PREEMPT_DYNAMIC Mon May 20 15:51:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
/home/doucej/Downloads/env-check.sh: line 148: xpu-smi: command not found
-----------------------------------------------------------------
/home/doucej/Downloads/env-check.sh: line 154: clinfo: command not found
-----------------------------------------------------------------
Driver related package version:
rc intel-fw-gpu 2022.47.1+190 all Firmware package for Intel integrated and discrete GPUs
ii intel-level-zero-gpu 1.3.26241.33-647~22.04 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is not installed. Please install xpu-smi according to README.md
(ollama) doucej@kryten:~$ sycl-ls
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz OpenCL 3.0 (Build 0) [2024.18.6.0.02_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A750 Graphics OpenCL 3.0 NEO [23.43.027642]
(ollama) doucej@kryten:~$
Now even the setenvs I was doing to workaround before aren't helping, just seg faults trying to start up. Will take another look at my oneapi install, I'm guessing something's messed up there. Odd that it wanted to update a bunch of those immediately after installation, not sure if that's expected.
Realized the title was off, working with Ubuntu 24.04. I've still not been able to get a good OneAPI installation even after a clean reinstall. I was able to get Ollama running with Vulkan with the the changes in https://github.com/ollama/ollama/pull/5059#. So maybe operator error here, nonetheless, I'm up and running, so this issue is moot for me for now.