Model request: OpenAI's gpt-oss-20b model
I want to run it on an Intel GPU, but I got
❯ ollama run gpt-oss
pulling manifest
Error: pull model manifest: 412:
The model you are attempting to pull requires a newer version of Ollama.
Please download the latest version at:
https://ollama.com/download
(base) barry@barry-System:~/ollama$ ./ollama run gpt-oss
pulling manifest
pulling b112e727c6f1: 100% ▕██████████████████▏ 13 GB
pulling 51468a0fd901: 100% ▕██████████████████▏ 7.4 KB
pulling f60356777647: 100% ▕██████████████████▏ 11 KB
pulling d8ba2f9a17b3: 100% ▕██████████████████▏ 18 B
pulling 8d6fddaf04b2: 100% ▕██████████████████▏ 489 B
verifying sha256 digest
writing manifest
success
Error: template: :3: function "currentDate" not defined
(base) barry@barry-System:~/ollama$ ./ollama run gpt-oss
Error: template: :3: function "currentDate" not defined
————————————————————————————————————There has a problem with ollama-ipex-llm-2.3.0b20250725-ubuntu.tgz
I also tried and got the same issue. I think Ollama with IPEX-LLM needs to get updated to downlod recent version of Ollama. Check issueOllame github issue
Same error. Even I used Ollama v0.11.2.0 to download the gpt-oss:20b model to the same model folder with ipex-llm, I still cannot run the model on Windows.
PS D:\ollama-ipex-llm-2.3.0b20250708-win> .\start-ollama.bat PS D:\ollama-ipex-llm-2.3.0b20250708-win> .\ollama.exe list NAME ID SIZE MODIFIED gpt-oss:20b f2b8351c629c 13 GB 2 minutes ago deepseek-r1:14b-qwen-distill-q4_K_M c333b7232bdb 9.0 GB 2 weeks ago gemma3:12b-it-q4_K_M f4031aab637d 8.1 GB 2 weeks ago deepseek-r1:8b-0528-qwen3-q4_K_M 6995872bfe4c 5.2 GB 2 weeks ago PS D:\ollama-ipex-llm-2.3.0b20250708-win> .\ollama.exe run gpt-oss:20b Error: template: :3: function "currentDate" not defined
We are currently working on supporting gpt-oss and will provide an update once it is complete.
Much appreciated. While you are doing that please also check for compatibility with other popular tools like Langflow. I am getting some errors with certain API requests: [GIN] 2025/08/07 - 06:30:23 | 200 | 2.035664ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/08/07 - 06:30:23 | 500 | 1.825335ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/08/07 - 06:30:53 | 200 | 6.497414248s | 127.0.0.1 | POST "/api/chat"
TIA
Hey, any updates on this?
LM Studio does that very well - e.g. I see all my Intel Iris Xe Graphics VRAM working for the gpt-oss:20b (at 13.4 tokens/sec) - but unfortunately it is closed source. Just to say that the feature s totally doable.
Hi all, I regret to inform you that support has been temporarily suspended. I apologize for any inconvenience this may cause.
Hi all, I regret to inform you that support has been temporarily suspended. I apologize for any inconvenience this may cause.
Why exactly?
Hi all, I regret to inform you that support has been temporarily suspended. I apologize for any inconvenience this may cause.
Why exactly?
PURE SPECULATION: I mean, Intel's not doing too hot right now; they may not be putting focus on open-source stuff right now.
I suppose the question is why the notice is written here in this specific issue instead of in a more general and visible place like the README
As I mentioned, I was able to run gpt-oss:20b entirely on an Intel GPU using LM Studio. If you want an open-source alternative, you can also build llama.cpp with ~~Intel oneAPI~~ (nope, maybe Vulkan).
AFAICS, the mxfp4 quantization format of this model is the problem. SYCL doesn't appear to support it:
tensor 'blk.0.ffn_gate_exps.weight' (mxfp4) (and 143 others) cannot be used with preferred buffer type SYCL0, using CPU instead
LM Studio offloads the entire model to Intel GPU because it uses the Vulkan backend.
AFAICS, the
mxfp4quantization format of this model is the problem. SYCL doesn't appear to support it:
tensor 'blk.0.ffn_gate_exps.weight' (mxfp4) (and 143 others) cannot be used with preferred buffer type SYCL0, using CPU insteadLM Studio offloads the entire model to Intel GPU because it uses the Vulkan backend.
I'm quite aware of this, as you say llama.cpp or any fork / wrapper of it (koboldcpp, ollama, lm studio, etc) use Vulkan to do this. Not only does the llama.cpp SYCL implementation not have MXFP4 support (or a fallback which emulates it, like it does for CUDA / Vulkan in case the GPU doesn't support MXFP4, which most don't at the time of writing.); it also doesn't properly support MoE models on SYCL, causing them to be slower rather than faster in comparison to dense models.
AFAICS, the
mxfp4quantization format of this model is the problem. SYCL doesn't appear to support it:tensor 'blk.0.ffn_gate_exps.weight' (mxfp4) (and 143 others) cannot be used with preferred buffer type SYCL0, using CPU insteadLM Studio offloads the entire model to Intel GPU because it uses the Vulkan backend.I'm quite aware of this, as you say llama.cpp or any fork / wrapper of it (koboldcpp, ollama, lm studio, etc) use Vulkan to do this. Not only does the llama.cpp SYCL implementation not have MXFP4 support (or a fallback which emulates it, like it does for CUDA / Vulkan in case the GPU doesn't support MXFP4, which most don't at the time of writing.); it also doesn't properly support MoE models on SYCL, causing them to be slower rather than faster in comparison to dense models.
I found your comment about MoE models being slower than dense ones on SYCL quite puzzling. On my dual A770 setup using Ollama with SYCL, Qwen3-30B-A3B is actually the fastest model I’ve run — consistently hitting over 40 tokens per second. That’s significantly better than any dense model of this size I’ve tested on the same hardware.
SYCL is about 20% slower than pure CPU on my system (Intel Core Ultra 155H). With IPEX token generation was almost on par with pure CPU llama.cpp (memory bound) but prompt processing was 3x faster, which is important in my application. I've tried to use vulkan (Ubuntu 22.04) but I cannot load any "big" model as the driver throws an out of memory error for >4GB (even though vulkaninfo shows the whole 96GB are available but I think it's a driver limitation) which makes it unusable for most models.
IIRC The 4G thing is a Ubuntu 22 bug, I had the same problem before upgrading. Also ofc make sure that Above 4G decoding / Resize BAR is enabled in BIOS
Van: ultoris @.> Verzonden: dinsdag 9 september 2025 4:20 Aan: intel/ipex-llm @.> CC: DarwinAnim8or @.>; Comment @.> Onderwerp: Re: [intel/ipex-llm] Model request: OpenAI's gpt-oss-20b model (Issue #13281)
[https://avatars.githubusercontent.com/u/12819167?s=20&v=4]ultoris left a comment (intel/ipex-llm#13281)https://github.com/intel/ipex-llm/issues/13281#issuecomment-3268625241
SYCL is about 20% slower than pure CPU on my system (Intel Core Ultra 155H). With IPEX token generation was almost on par with pure CPU llama.cpp (memory bound) but prompt processing was 3x faster, which is important in my application. I've tried to use vulkan (Ubuntu 22.04) but I cannot load any "big" model as the driver throws an out of memory error for >4GB (even though vulkaninfo shows the whole 96GB are available but I think it's a driver limitation) which makes it unusable for most models.
— Reply to this email directly, view it on GitHubhttps://github.com/intel/ipex-llm/issues/13281#issuecomment-3268625241, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABMOUMVV3A5KDSJCLK4IJW33RY2PVAVCNFSM6AAAAACDGYSC76VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTENRYGYZDKMRUGE. You are receiving this because you commented.Message ID: @.***>