ipex-llm Model request: OpenAI's gpt-oss-20b model

I want to run it on an Intel GPU, but I got

❯ ollama run gpt-oss
pulling manifest
Error: pull model manifest: 412:

The model you are attempting to pull requires a newer version of Ollama.

Please download the latest version at:

        https://ollama.com/download

Aug 06 '25 04:08 chengcheng84

(base) barry@barry-System:~/ollama$ ./ollama run gpt-oss pulling manifest pulling b112e727c6f1: 100% ▕██████████████████▏ 13 GB
pulling 51468a0fd901: 100% ▕██████████████████▏ 7.4 KB
pulling f60356777647: 100% ▕██████████████████▏ 11 KB
pulling d8ba2f9a17b3: 100% ▕██████████████████▏ 18 B
pulling 8d6fddaf04b2: 100% ▕██████████████████▏ 489 B
verifying sha256 digest writing manifest success Error: template: :3: function "currentDate" not defined (base) barry@barry-System:~/ollama$ ./ollama run gpt-oss Error: template: :3: function "currentDate" not defined ————————————————————————————————————There has a problem with ollama-ipex-llm-2.3.0b20250725-ubuntu.tgz

Aug 06 '25 05:08 Rayegoe

I also tried and got the same issue. I think Ollama with IPEX-LLM needs to get updated to downlod recent version of Ollama. Check issueOllame github issue

Aug 06 '25 10:08 yshashix

Same error. Even I used Ollama v0.11.2.0 to download the gpt-oss:20b model to the same model folder with ipex-llm, I still cannot run the model on Windows.

PS D:\ollama-ipex-llm-2.3.0b20250708-win> .\start-ollama.bat PS D:\ollama-ipex-llm-2.3.0b20250708-win> .\ollama.exe list NAME ID SIZE MODIFIED gpt-oss:20b f2b8351c629c 13 GB 2 minutes ago deepseek-r1:14b-qwen-distill-q4_K_M c333b7232bdb 9.0 GB 2 weeks ago gemma3:12b-it-q4_K_M f4031aab637d 8.1 GB 2 weeks ago deepseek-r1:8b-0528-qwen3-q4_K_M 6995872bfe4c 5.2 GB 2 weeks ago PS D:\ollama-ipex-llm-2.3.0b20250708-win> .\ollama.exe run gpt-oss:20b Error: template: :3: function "currentDate" not defined

Aug 06 '25 14:08 rafalelele

We are currently working on supporting gpt-oss and will provide an update once it is complete.

Aug 07 '25 02:08 cyita

Much appreciated. While you are doing that please also check for compatibility with other popular tools like Langflow. I am getting some errors with certain API requests: [GIN] 2025/08/07 - 06:30:23 | 200 | 2.035664ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/08/07 - 06:30:23 | 500 | 1.825335ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/08/07 - 06:30:53 | 200 | 6.497414248s | 127.0.0.1 | POST "/api/chat"

TIA

Aug 07 '25 05:08 erwinzierler

Hey, any updates on this?

Aug 28 '25 09:08 Eyt-Lev

LM Studio does that very well - e.g. I see all my Intel Iris Xe Graphics VRAM working for the gpt-oss:20b (at 13.4 tokens/sec) - but unfortunately it is closed source. Just to say that the feature s totally doable.

Aug 28 '25 09:08 giuliohome

Hi all, I regret to inform you that support has been temporarily suspended. I apologize for any inconvenience this may cause.

Aug 28 '25 09:08 cyita

Hi all, I regret to inform you that support has been temporarily suspended. I apologize for any inconvenience this may cause.

Why exactly?

Aug 28 '25 09:08 tristan-k

Hi all, I regret to inform you that support has been temporarily suspended. I apologize for any inconvenience this may cause.

Why exactly?

PURE SPECULATION: I mean, Intel's not doing too hot right now; they may not be putting focus on open-source stuff right now.

Sep 01 '25 12:09 DarwinAnim8or

I suppose the question is why the notice is written here in this specific issue instead of in a more general and visible place like the README

As I mentioned, I was able to run gpt-oss:20b entirely on an Intel GPU using LM Studio. If you want an open-source alternative, you can also build llama.cpp with ~~Intel oneAPI~~ (nope, maybe Vulkan).

Sep 01 '25 12:09 giuliohome

AFAICS, the mxfp4 quantization format of this model is the problem. SYCL doesn't appear to support it:

tensor 'blk.0.ffn_gate_exps.weight' (mxfp4) (and 143 others) cannot be used with preferred buffer type SYCL0, using CPU instead

LM Studio offloads the entire model to Intel GPU because it uses the Vulkan backend.

Sep 01 '25 22:09 giuliohome

AFAICS, the mxfp4 quantization format of this model is the problem. SYCL doesn't appear to support it:

tensor 'blk.0.ffn_gate_exps.weight' (mxfp4) (and 143 others) cannot be used with preferred buffer type SYCL0, using CPU instead

LM Studio offloads the entire model to Intel GPU because it uses the Vulkan backend.

I'm quite aware of this, as you say llama.cpp or any fork / wrapper of it (koboldcpp, ollama, lm studio, etc) use Vulkan to do this. Not only does the llama.cpp SYCL implementation not have MXFP4 support (or a fallback which emulates it, like it does for CUDA / Vulkan in case the GPU doesn't support MXFP4, which most don't at the time of writing.); it also doesn't properly support MoE models on SYCL, causing them to be slower rather than faster in comparison to dense models.

Sep 02 '25 11:09 DarwinAnim8or

AFAICS, the mxfp4 quantization format of this model is the problem. SYCL doesn't appear to support it: tensor 'blk.0.ffn_gate_exps.weight' (mxfp4) (and 143 others) cannot be used with preferred buffer type SYCL0, using CPU instead LM Studio offloads the entire model to Intel GPU because it uses the Vulkan backend.

I'm quite aware of this, as you say llama.cpp or any fork / wrapper of it (koboldcpp, ollama, lm studio, etc) use Vulkan to do this. Not only does the llama.cpp SYCL implementation not have MXFP4 support (or a fallback which emulates it, like it does for CUDA / Vulkan in case the GPU doesn't support MXFP4, which most don't at the time of writing.); it also doesn't properly support MoE models on SYCL, causing them to be slower rather than faster in comparison to dense models.

I found your comment about MoE models being slower than dense ones on SYCL quite puzzling. On my dual A770 setup using Ollama with SYCL, Qwen3-30B-A3B is actually the fastest model I’ve run — consistently hitting over 40 tokens per second. That’s significantly better than any dense model of this size I’ve tested on the same hardware.

Sep 06 '25 09:09 fradav

SYCL is about 20% slower than pure CPU on my system (Intel Core Ultra 155H). With IPEX token generation was almost on par with pure CPU llama.cpp (memory bound) but prompt processing was 3x faster, which is important in my application. I've tried to use vulkan (Ubuntu 22.04) but I cannot load any "big" model as the driver throws an out of memory error for >4GB (even though vulkaninfo shows the whole 96GB are available but I think it's a driver limitation) which makes it unusable for most models.

Sep 09 '25 02:09 ultoris

IIRC The 4G thing is a Ubuntu 22 bug, I had the same problem before upgrading. Also ofc make sure that Above 4G decoding / Resize BAR is enabled in BIOS

Van: ultoris @.> Verzonden: dinsdag 9 september 2025 4:20 Aan: intel/ipex-llm @.> CC: DarwinAnim8or @.>; Comment @.> Onderwerp: Re: [intel/ipex-llm] Model request: OpenAI's gpt-oss-20b model (Issue #13281)

[https://avatars.githubusercontent.com/u/12819167?s=20&v=4]ultoris left a comment (intel/ipex-llm#13281)https://github.com/intel/ipex-llm/issues/13281#issuecomment-3268625241

SYCL is about 20% slower than pure CPU on my system (Intel Core Ultra 155H). With IPEX token generation was almost on par with pure CPU llama.cpp (memory bound) but prompt processing was 3x faster, which is important in my application. I've tried to use vulkan (Ubuntu 22.04) but I cannot load any "big" model as the driver throws an out of memory error for >4GB (even though vulkaninfo shows the whole 96GB are available but I think it's a driver limitation) which makes it unusable for most models.

— Reply to this email directly, view it on GitHubhttps://github.com/intel/ipex-llm/issues/13281#issuecomment-3268625241, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABMOUMVV3A5KDSJCLK4IJW33RY2PVAVCNFSM6AAAAACDGYSC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTENRYGYZDKMRUGE. You are receiving this because you commented.Message ID: @.***>

Sep 09 '25 11:09 DarwinAnim8or