ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

flash attention enabled but not supported by gpu

Open ghost opened this issue 5 months ago • 8 comments

Describe the bug Title.

How to reproduce Steps to reproduce the error:

  1. launch ipex-llm's ollama
  2. run a model (in this case, unsloth deepseek
  3. get that message in logs

Screenshots If applicable, add screenshots to help explain the problem

Environment information

-----------------------------------------------------------------
PYTHON_VERSION=3.11.13
-----------------------------------------------------------------
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
transformers=4.36.2
-----------------------------------------------------------------
torch=2.2.0+cu121
-----------------------------------------------------------------
ipex-llm Version: 2.3.0b20250629
-----------------------------------------------------------------
IPEX is not installed. 
-----------------------------------------------------------------
CPU Information: 
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           39 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               GenuineIntel
Model name:                              12th Gen Intel(R) Core(TM) i5-12500H
CPU family:                              6
Model:                                   154
Thread(s) per core:                      2
Core(s) per socket:                      12
Socket(s):                               1
Stepping:                                3
CPU max MHz:                             4500.0000
CPU min MHz:                             400.0000
BogoMIPS:                                6220.80
-----------------------------------------------------------------
Total CPU Memory: 23.1682 GB
Memory Type: sudo: dmidecode: command not found
-----------------------------------------------------------------
Operating System: 
Ubuntu 22.04.5 LTS \n \l

-----------------------------------------------------------------
Linux 05a8037651d0 6.15.3 #1-NixOS SMP PREEMPT_DYNAMIC Thu Jun 19 13:41:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
/mnt/env-check.sh: line 148: xpu-smi: command not found
-----------------------------------------------------------------
/mnt/env-check.sh: line 154: clinfo: command not found
-----------------------------------------------------------------
Driver related package version:
ii  intel-level-zero-gpu                             1.6.32224.5                             amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  intel-level-zero-gpu-legacy1                     1.3.30872.22                            amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  level-zero-devel                                 1.20.2                                  amd64        oneAPI Level Zero
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is not installed. Please install xpu-smi according to README.md

Hmmm, weird. The iGPU does get used, though.

Additional context Add any other context about the problem here.

ghost avatar Jun 30 '25 23:06 ghost

Hi @stereomato , I don't quite understand your problem, could you please provide us with detailed running log and the messages you mentioned ?

rnwang04 avatar Jul 01 '25 02:07 rnwang04

I enable OLLAMA_FLASH_ATTENTION because i wanna use kv cache (q8) to speed things up, but i got that warning. Laptop's got an Iris Xe Graphics from an alder lake laptop

ghost avatar Jul 01 '25 02:07 ghost

Hi @stereomato , we do not have support for OLLAMA_FLASH_ATTENTION yet.

rnwang04 avatar Jul 01 '25 04:07 rnwang04

Hi @stereomato , if you want to use fp8 quantized kv cache, you could try export IPEX_LLM_QUANTIZE_KV_CACHE=1 before ./ollama serve . It might work for models running with llamarunner.

rnwang04 avatar Jul 01 '25 05:07 rnwang04

how do i know if it works?

ghost avatar Jul 01 '25 05:07 ghost

Take ./ollama run qwen3 for exmaple, original kv output looks like:

llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1, padding = 32
llama_kv_cache_unified:      SYCL0 KV buffer size =   576.00 MiB
llama_kv_cache_unified: KV self size  =  576.00 MiB, K (f16):  288.00 MiB, V (f16):  288.00 MiB

If export IPEX_LLM_QUANTIZE_KV_CACHE=1 works, the output looks like:

llama_kv_cache_unified: kv_size = 4096, type_k = 'i8', type_v = 'i8', n_layer = 36, can_shift = 1, padding = 32
llama_kv_cache_unified:      SYCL0 KV buffer size =   288.00 MiB
llama_kv_cache_unified: KV self size  =  288.00 MiB, K (i8):  144.00 MiB, V (i8):  144.00 MiB

rnwang04 avatar Jul 01 '25 06:07 rnwang04

alright, I do get that output, but I also get a warning about "This model is not recommended to use quantize kv cache!" with unsloth qwen3 (4B, q4_k_m) and unsloth deepseek r1 0528 (8b, q4_k_m)

ghost avatar Jul 01 '25 14:07 ghost

Yeah I just took qwen3 for an example. Actually for such models with grouped query attention, quantized kv does not bring obvious benefit .

rnwang04 avatar Jul 02 '25 01:07 rnwang04