ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

Ollama Linux seg fault with GPU on Ubuntu 22.04

Open doucej opened this issue 1 year ago • 6 comments

Ran into seg faults trying to run Ollama on Ubuntu 22.04. This is with an Intel Arc A750 card.

A few searches showed a similar finigerprint in https://github.com/intel/compute-runtime/issues/710. Reporting this as similar workaround of:

export NEOReadDebugKeys=1 export OverrideGpuAddressSpace=48

does the trick to get things working. Did a: pip install --pre --upgrade ipex-llm[cpp]

just today, still not getting versions that have this fixed, so maybe coming soon, but wanted to raise failure fingerprint and workaround.

doucej avatar Jun 09 '24 16:06 doucej

Hi @doucej , could you please provide more information from the ollama server side (like the ollama server log)? This would be helpful for us in addressing the issue and fixing it.

Also, would you mind running this https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/scripts to check system environment and reply us the output?

sgwhat avatar Jun 11 '24 01:06 sgwhat

Sure -- below is failure log, starting via:

export OLLAMA_NUM_GPU=999 export no_proxy=localhost,127.0.0.1 export ZES_ENABLE_SYSMAN=1 source /opt/intel/oneapi/setvars.sh export SYCL_CACHE_PERSISTENT=1 export OLLAMA_HOST=0.0.0.0:11434

2024/06/11 10:46:30 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]" time=2024-06-11T10:46:30.108-04:00 level=INFO source=images.go:729 msg="total blobs: 16" time=2024-06-11T10:46:30.108-04:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.

  • using env: export GIN_MODE=release
  • using code: gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-06-11T10:46:30.109-04:00 level=INFO source=routes.go:1074 msg="Listening on [::]:11434 (version 0.0.0)" time=2024-06-11T10:46:30.109-04:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama3171358111/runners time=2024-06-11T10:46:30.185-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]" time=2024-06-11T10:46:30.197-04:00 level=INFO source=types.go:71 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="39.1 GiB" available="21.5 GiB" [GIN] 2024/06/11 - 10:46:43 | 200 | 89.1µs | 127.0.0.1 | HEAD "/" [GIN] 2024/06/11 - 10:46:43 | 200 | 1.165584ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/06/11 - 10:46:43 | 200 | 754.647µs | 127.0.0.1 | POST "/api/show" time=2024-06-11T10:46:46.057-04:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="21.5 GiB" memory.required.full="4.6 GiB" memory.required.partial="4.6 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" time=2024-06-11T10:46:46.058-04:00 level=INFO source=server.go:342 msg="starting llama server" cmd="/tmp/ollama3171358111/runners/cpu_avx2/ollama_llama_server --model /home/doucej/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 38351" time=2024-06-11T10:46:46.059-04:00 level=INFO source=sched.go:338 msg="loaded runners" count=1 time=2024-06-11T10:46:46.059-04:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding" time=2024-06-11T10:46:46.059-04:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="1e71e4c" tid="127747830650880" timestamp=1718117206 INFO [main] system info | n_threads=22 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="127747830650880" timestamp=1718117206 total_threads=44 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="43" port="38351" tid="127747830650880" timestamp=1718117206 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/doucej/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-06-11T10:46:46.311-04:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' [SYCL] call ggml_init_sycl ggml_init_sycl: GGML_SYCL_DEBUG: 0 ggml_init_sycl: GGML_SYCL_F16: no found 3 SYCL devices: | | | | |Max | |Max |Global | | | | | | |compute|Max work|sub |mem | |

ID Device Type Name Version units group group size Driver version
0 [opencl:gpu:0] Intel Arc A750 Graphics 3.0 448 1024 32 8096M 23.43.027642
1 [opencl:cpu:0] Intel Xeon CPU E5-2696 v4 @ 2.20GHz 3.0 44 8192 64 41964M 2024.17.3.0.08_160000
2 [opencl:acc:0] Intel FPGA Emulation Device 1.2 44 67108864 64 41964M 2024.17.3.0.08_160000
ggml_backend_sycl_set_mul_device_mode: true
llama_model_load: error loading model: DeviceList is empty. -30 (PI_ERROR_INVALID_VALUE)
llama_load_model_from_file: exception loading model
terminate called after throwing an instance of 'sycl::_V1::invalid_parameter_error'
what(): DeviceList is empty. -30 (PI_ERROR_INVALID_VALUE)
time=2024-06-11T10:46:47.516-04:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server not responding"
time=2024-06-11T10:46:48.295-04:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
time=2024-06-11T10:46:48.546-04:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) "
[GIN] 2024/06/11 - 10:46:48 500 5.338034878s 127.0.0.1 POST "/api/chat"

Connected to server through "ollama run llama3"

Adding: export NEOReadDebugKeys=1 export OverrideGpuAddressSpace=48

I get:

2024/06/11 10:49:51 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]" time=2024-06-11T10:49:51.435-04:00 level=INFO source=images.go:729 msg="total blobs: 16" time=2024-06-11T10:49:51.436-04:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.

  • using env: export GIN_MODE=release
  • using code: gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2024-06-11T10:49:51.436-04:00 level=INFO source=routes.go:1074 msg="Listening on [::]:11434 (version 0.0.0)" time=2024-06-11T10:49:51.437-04:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2342145819/runners time=2024-06-11T10:49:51.529-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]" time=2024-06-11T10:49:51.546-04:00 level=INFO source=types.go:71 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="39.1 GiB" available="21.4 GiB" [GIN] 2024/06/11 - 10:50:00 | 200 | 71.283µs | 127.0.0.1 | HEAD "/" [GIN] 2024/06/11 - 10:50:00 | 200 | 1.271359ms | 127.0.0.1 | POST "/api/show" [GIN] 2024/06/11 - 10:50:00 | 200 | 984.838µs | 127.0.0.1 | POST "/api/show" time=2024-06-11T10:50:03.871-04:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=33 memory.available="21.4 GiB" memory.required.full="4.6 GiB" memory.required.partial="4.6 GiB" memory.required.kv="256.0 MiB" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" time=2024-06-11T10:50:03.871-04:00 level=INFO source=server.go:342 msg="starting llama server" cmd="/tmp/ollama2342145819/runners/cpu_avx2/ollama_llama_server --model /home/doucej/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 37773" time=2024-06-11T10:50:03.872-04:00 level=INFO source=sched.go:338 msg="loaded runners" count=1 time=2024-06-11T10:50:03.872-04:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding" time=2024-06-11T10:50:03.873-04:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="1e71e4c" tid="138128694257664" timestamp=1718117403 INFO [main] system info | n_threads=22 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="138128694257664" timestamp=1718117403 total_threads=44 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="43" port="37773" tid="138128694257664" timestamp=1718117403 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/doucej/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-06-11T10:50:04.124-04:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' [SYCL] call ggml_init_sycl ggml_init_sycl: GGML_SYCL_DEBUG: 0 ggml_init_sycl: GGML_SYCL_F16: no found 4 SYCL devices: | | | | |Max | |Max |Global | | | | | | |compute|Max work|sub |mem | |

ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel Arc A750 Graphics 1.3 448 1024 32 8096M 1.3.26241
1 [opencl:gpu:0] Intel Arc A750 Graphics 3.0 448 1024 32 8096M 23.43.027642
2 [opencl:cpu:0] Intel Xeon CPU E5-2696 v4 @ 2.20GHz 3.0 44 8192 64 41964M 2024.17.3.0.08_160000
3 [opencl:acc:0] Intel FPGA Emulation Device 1.2 44 67108864 64 41964M 2024.17.3.0.08_160000
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:448
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: SYCL0 buffer size = 4155.99 MiB
llm_load_tensors: CPU buffer size = 281.81 MiB
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: SYCL0 KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.50 MiB
[1718117411] warming up the model with an empty run
llama_new_context_with_model: SYCL0 compute buffer size = 258.50 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1062
llama_new_context_with_model: graph splits = 2
INFO [main] model loaded tid="138128694257664" timestamp=1718117412
time=2024-06-11T10:50:12.928-04:00 level=INFO source=server.go:571 msg="llama runner started in 9.06 seconds"
[GIN] 2024/06/11 - 10:50:12 200 11.976264811s 127.0.0.1 POST "/api/chat"

Ollama is up and correctly responding to requests as of this point.

doucej avatar Jun 11 '24 14:06 doucej

hi @doucej, this issue is due to oneAPI not being installed correctly. You may run ollama with following steps:

  1. Please run sycl-ls to check your sycl devices. The expected output should be as below:

    [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  
    [2023.16.12.0.12_195853.xmain-hotfix]
    [opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 5 125H OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
    [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO  [24.09.28717.12]
    [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) Graphics 1.3 [1.3.28717]
    

    If [ext_oneapi_level_zero:gpu] is not present, please proceed to step 2.

  2. You may follow our guide to reinstall oneAPI: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id1.

sgwhat avatar Jun 12 '24 02:06 sgwhat

Thanks -- yes, I do have my GPU listed:

(base) doucej@kryten:~$ source /opt/intel/oneapi/setvars.sh
 
:: initializing oneAPI environment ...
   bash: BASH_VERSION = 5.2.21(1)-release
   args: Using "$@" for setvars.sh arguments: 
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: tbb -- latest
:: vtune -- latest
:: oneAPI environment initialized ::
 
(base) doucej@kryten:~$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.3.0.08_160000]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A750 Graphics OpenCL 3.0 NEO  [23.43.027642]
(base) doucej@kryten:~$ 

I did try creating a new conda environment today through:

conda create -n llm-cpp python=3.11
conda activate llm-cpp
pip install --pre --upgrade ipex-llm[cpp]

And followed through with deleting the old ollama binary and an init-ollama.

Back to the same point, need those extra environment variables for it to see/utilize the GPU without crashing.

When installing as above, do I also need to go through and install/update the oneapi packages through apt, or is above PIP getting that set up for me? I could try that if this still seems like a setup/installation issue.

Thanks!

doucej avatar Jun 21 '24 12:06 doucej

Hi @doucej , this failure of Ollama to run is because you haven't correctly installed OneAPI. If you have installed OneAPI correctly, [ext_oneapi_level_zero:gpu] should be present as expected in the output of sycl-ls.

To fix this issue, you may follow our guide to reinstall oneAPI: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id1.

Also, would you mind running https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/scripts to check system environment and reply us the output?

sgwhat avatar Jun 24 '24 03:06 sgwhat

Some notes below. I did go through and try to reinstall oneAPI, looks like indeed I had an old version on there from previous rev ubuntu most likely.

Some notes:

Before touching anything, after activating llm-cpp conda, source /opt/intel/oneapi/setvars.sh:

(llm-cpp) doucej@kryten:~$ ~/Downloads/env-check.sh 
-----------------------------------------------------------------
PYTHON_VERSION=3.11.9
-----------------------------------------------------------------
transformers=4.41.2
-----------------------------------------------------------------
torch=2.2.0+cu121
-----------------------------------------------------------------
ipex-llm Version: 2.1.0b20240620
-----------------------------------------------------------------
IPEX is not installed. 
-----------------------------------------------------------------
CPU Information: 
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               44
On-line CPU(s) list:                  0-43
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz
CPU family:                           6
Model:                                79
Thread(s) per core:                   2
Core(s) per socket:                   22
Socket(s):                            1
Stepping:                             1
CPU(s) scaling MHz:                   55%
CPU max MHz:                          3700.0000
CPU min MHz:                          1200.0000
-----------------------------------------------------------------
Total CPU Memory: 39.0826 GB
-----------------------------------------------------------------
Operating System: 
Ubuntu 24.04 LTS \n \l

-----------------------------------------------------------------
Linux kryten 6.8.0-35-generic #35-Ubuntu SMP PREEMPT_DYNAMIC Mon May 20 15:51:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
/home/doucej/Downloads/env-check.sh: line 148: xpu-smi: command not found
-----------------------------------------------------------------
/home/doucej/Downloads/env-check.sh: line 154: clinfo: command not found
-----------------------------------------------------------------
Driver related package version:
rc  intel-fw-gpu                                    2022.47.1+190                            all          Firmware package for Intel integrated and discrete GPUs
ii  intel-level-zero-gpu                            1.3.26241.33-647~22.04                   amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is not installed. Please install xpu-smi according to README.md
(llm-cpp) doucej@kryten:~$ 

after:

wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null

echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list

sudo apt update

sudo apt install intel-oneapi-common-vars=2024.0.0-49406
intel-oneapi-common-oneapi-vars=2024.0.0-49406
intel-oneapi-diagnostics-utility=2024.0.0-49093
intel-oneapi-compiler-dpcpp-cpp=2024.0.2-49895
intel-oneapi-dpcpp-ct=2024.0.0-49381
intel-oneapi-mkl=2024.0.0-49656
intel-oneapi-mkl-devel=2024.0.0-49656
intel-oneapi-mpi=2021.11.0-49493
intel-oneapi-mpi-devel=2021.11.0-49493
intel-oneapi-dal=2024.0.1-25
intel-oneapi-dal-devel=2024.0.1-25
intel-oneapi-ippcp=2021.9.1-5
intel-oneapi-ippcp-devel=2021.9.1-5
intel-oneapi-ipp=2021.10.1-13
intel-oneapi-ipp-devel=2021.10.1-13
intel-oneapi-tlt=2024.0.0-352
intel-oneapi-ccl=2021.11.2-5
intel-oneapi-ccl-devel=2021.11.2-5
intel-oneapi-dnnl-devel=2024.0.0-49521
intel-oneapi-dnnl=2024.0.0-49521
intel-oneapi-tcm-1.0=1.0.0-435

Let apt upgrade a bunch of intel-oneapi packages after initial install & reboot for good measure:

(ollama) doucej@kryten:~$ ~/Downloads/env-check.sh 
-----------------------------------------------------------------
PYTHON_VERSION=3.11.9
-----------------------------------------------------------------
transformers=4.41.2
-----------------------------------------------------------------
torch=2.2.0+cu121
-----------------------------------------------------------------
ipex-llm Version: 2.1.0b20240624
-----------------------------------------------------------------
IPEX is not installed. 
-----------------------------------------------------------------
CPU Information: 
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               44
On-line CPU(s) list:                  0-43
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz
CPU family:                           6
Model:                                79
Thread(s) per core:                   2
Core(s) per socket:                   22
Socket(s):                            1
Stepping:                             1
CPU(s) scaling MHz:                   55%
CPU max MHz:                          3700.0000
CPU min MHz:                          1200.0000
-----------------------------------------------------------------
Total CPU Memory: 39.0826 GB
-----------------------------------------------------------------
Operating System: 
Ubuntu 24.04 LTS \n \l

-----------------------------------------------------------------
Linux kryten 6.8.0-35-generic #35-Ubuntu SMP PREEMPT_DYNAMIC Mon May 20 15:51:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
/home/doucej/Downloads/env-check.sh: line 148: xpu-smi: command not found
-----------------------------------------------------------------
/home/doucej/Downloads/env-check.sh: line 154: clinfo: command not found
-----------------------------------------------------------------
Driver related package version:
rc  intel-fw-gpu                                    2022.47.1+190                            all          Firmware package for Intel integrated and discrete GPUs
ii  intel-level-zero-gpu                            1.3.26241.33-647~22.04                   amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is not installed. Please install xpu-smi according to README.md
(ollama) doucej@kryten:~$ sycl-ls
[opencl:cpu][opencl:0] Intel(R) OpenCL, Intel(R) Xeon(R) CPU E5-2696 v4 @ 2.20GHz OpenCL 3.0 (Build 0) [2024.18.6.0.02_160000]
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A750 Graphics OpenCL 3.0 NEO  [23.43.027642]
(ollama) doucej@kryten:~$ 

Now even the setenvs I was doing to workaround before aren't helping, just seg faults trying to start up. Will take another look at my oneapi install, I'm guessing something's messed up there. Odd that it wanted to update a bunch of those immediately after installation, not sure if that's expected.

doucej avatar Jun 24 '24 17:06 doucej

Realized the title was off, working with Ubuntu 24.04. I've still not been able to get a good OneAPI installation even after a clean reinstall. I was able to get Ollama running with Vulkan with the the changes in https://github.com/ollama/ollama/pull/5059#. So maybe operator error here, nonetheless, I'm up and running, so this issue is moot for me for now.

doucej avatar Aug 04 '24 16:08 doucej