Ruonan Wang

Results 100 comments of Ruonan Wang

Hi @brownplayer , what is your OS and what is your target model ? Flashmoe only supports Linux for now.

We may consider to support flashmoe for Windows later. But for now, I guess you could run qwen3-30b-3b on Windows with ipex-llm llama.cpp (maybe pip install or portable zip), you...

Hi @shailesh837 , gemma3n is supported from `ipex-llm[cpp]==2.3.0b20250630`. You could try it first with `pip install --pre --upgrade ipex-llm[cpp]` or waiting for new ollama portable zip, we will release it...

Hi @stereomato , I don't quite understand your problem, could you please provide us with detailed running log and the messages you mentioned ?

Hi @shailesh837 , new portable zip is uploaded here: https://github.com/ipex-llm/ipex-llm/releases/download/v2.3.0-nightly/ollama-ipex-llm-2.3.0b20250630-win.zip , and in this new portable zip, you can run with gemma3n.

Hi @stereomato , we do not have support for `OLLAMA_FLASH_ATTENTION` yet.

Hi @stereomato , if you want to use fp8 quantized kv cache, you could try `export IPEX_LLM_QUANTIZE_KV_CACHE=1` before `./ollama serve` . It might work for models running with llamarunner.

Take `./ollama run qwen3` for exmaple, original kv output looks like: ```bash llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1, padding =...

Yeah I just took qwen3 for an example. Actually for such models with grouped query attention, quantized kv does not bring obvious benefit .

> what is the more commanded version, the nightly release or the docker container ? I tried `gemma3n` with the latest docker container and it doesn't work Hi @FilipLaurentiu ,...