llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Misc. bug: SYCL out of memory error

Open BenPortner opened this issue 1 month ago • 20 comments

Name and Version

ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no ggml_sycl_init: SYCL_USE_XMX: yes ggml_sycl_init: found 1 SYCL devices: version: 4404 (0827b2c1) built with MSVC 19.42.34435.0

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

libllama (core library)

Problem description & steps to reproduce

Problem

I run into memory errors when using the SYCL backend. No error appears when running the same setup with the VULKAN backend (same model, prompt, context length, batch size, etc.). In the example below, the error says that 568 MB could not be allocated. This is strange because I have 16 GB of GPU memory (shared system memory, not dedicated). It seems the error is not specific to llama-cli because it also occurs when I use the Python bindings (llama-cpp-python). The error also occurs in earlier versions (I tried b4311).

Hardware

Dell Latitude 5420 Windows 10 Enterprise CPU: 11th Gen Intel i7-1185G7 @ 3.00GHz, 4 Cores, 8 Logical Processors x86_64 RAM: 2x16GB Hynix 3200MHz DDR4 PC4-25600 GPU: Intel Iris Xe iGPU Storage: Western Digital PC SN530 NVMe WDC 512GB M.2 SSD

Minimum Error example

rem create very long prompt
python -c "f = open('prompt.txt', 'w'); prompt = 'bla '*40000; f.write(prompt); f.close();"

rem run llama-cli
llama-cli.exe -m "C:\path\to\Llama-3.2-3B-Instruct-Q4_0.gguf" --file prompt.txt -n 20 -ngl 99 -c 40100 --no-display-prompt

rem complete log attached
alloc: can't allocate 568118476 Bytes of memory on device/GPU
Enqueue process failed.
Exception caught at file:D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\ggml-sycl.cpp, line:3404, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *main_stream, oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
  in function ggml_sycl_mul_mat_batched_sycl at D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\ggml-sycl.cpp:3404
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\..\ggml-sycl\common.hpp:111: SYCL error

First Bad Commit

No response

Relevant log output

C:\...\llama.cpp\b4404\sycl>llama-cli.exe -m "C:\path\to\Llama-3.2-3B-Instruct-Q4_0.gguf" --file prompt.txt -n 20 -ngl 99 -c 40100 --no-display-prompt
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
build: 4404 (0827b2c1) with MSVC 19.42.34435.0 for
main: llama backend init
main: load the model and apply lora adapter, if any
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_load_model_from_file: using device SYCL0 (Intel(R) Iris(R) Xe Graphics) - 14658 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from C:\path\to\Llama-3.2-3B-Instruct-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 28
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  19:                          general.file_type u32              = 2
llama_model_loader: - kv  20:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  21:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                      quantize.imatrix.file str              = /models_out/Llama-3.2-3B-Instruct-GGU...
llama_model_loader: - kv  32:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  33:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  34:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_0:  193 tensors
llama_model_loader: - type q4_1:    3 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.78 GiB (4.77 BPW)
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        SYCL0 model buffer size =  1825.40 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   308.23 MiB
.........................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 40128
llama_new_context_with_model: n_ctx_per_seq = 40128
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (40128) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                 Intel Iris Xe Graphics|   12.0|     96|     512|   32| 15370M|            1.3.29803|
llama_kv_cache_init: kv_size = 40128, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28
llama_kv_cache_init:      SYCL0 KV buffer size =  4389.00 MiB
llama_new_context_with_model: KV self size  = 4389.00 MiB, K (f16): 2194.50 MiB, V (f16): 2194.50 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =  1983.38 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    84.38 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 40128
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4

system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 3140513417
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 40128
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 40128, n_batch = 2048, n_predict = 20, n_keep = 1

alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
Enqueue process failed.
Exception caught at file:D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\ggml-sycl.cpp, line:3404, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *main_stream, oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
  in function ggml_sycl_mul_mat_batched_sycl at D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\ggml-sycl.cpp:3404
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\..\ggml-sycl\common.hpp:111: SYCL error

BenPortner avatar Jan 02 '25 14:01 BenPortner