GPT-OSS 120B ERROR llama.cpp ipex-llm==2.3.0b20251104
Hello, latest ipex-llm ( ipex-llm==2.3.0b20251104) has issue with models: gpt-oss 120b, gpt-oss 20B
ENV
(ipex-llm) arc@xpu:~/llm/ipex-llm/llama-cpp$ uv pip install --pre --upgrade ipex-llm[cpp]
Using Python 3.11.14 environment at: /home/arc/miniconda3/envs/ipex-llm
Resolved 45 packages in 1.12s
Prepared 9 packages in 1.84s
Uninstalled 9 packages in 244ms
Installed 9 packages in 218ms
- bigdl-core-cpp==2.7.0b20251022
+ bigdl-core-cpp==2.7.0b20251104
- fsspec==2025.9.0
+ fsspec==2025.10.0
- hf-xet==1.1.11b1
+ hf-xet==1.2.0
- huggingface-hub==0.36.0rc0
+ huggingface-hub==0.36.0
- ipex-llm==2.3.0b20251022
+ ipex-llm==2.3.0b20251104
- networkx==3.5
+ networkx==3.6rc0
- psutil==7.1.1
+ psutil==7.1.3
- regex==2025.10.23
+ regex==2025.11.3
- safetensors==0.6.2
+ safetensors==0.7.0rc0
llama-server
tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
(ipex-llm) arc@xpu:~/llm/ipex-llm/llama-cpp$ ./llama-server -m ~/llm/models/UD-Q8_K_XL/gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf -ngl 99 -c 8192 --jinja
build: 1 (98abe88) with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
system info: n_threads = 36, n_threads_batch = 36, total_threads = 72
system_info: n_threads = 36 (n_threads_batch = 36) / 72 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 71
main: loading model
srv load_model: loading model '/home/arc/llm/models/UD-Q8_K_XL/gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf'
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
llama_model_load_from_file_impl: using device SYCL1 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
gguf_init_from_file_impl: failed to read tensor info
llama_model_load: error loading model: llama_model_loader: failed to load model from /home/arc/llm/models/UD-Q8_K_XL/gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/home/arc/llm/models/UD-Q8_K_XL/gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf'
srv load_model: failed to load model, '/home/arc/llm/models/UD-Q8_K_XL/gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf'
srv operator(): operator(): cleaning up before exit...
main: exiting due to model loading error
Facing the same issue
Hello SONG Ge @sgwhat Is there any information about when this might appear in supported models?
Not sure. I am not working on developing it recently.
The fix was added to an August release of ollama, so I'm guessing we have to wait for the now seemingly defunct intel team to roll in the ollama fix into IPEX. I'm thinking this project is on the back burner or they fired the talented staff that was working on this.
https://github.com/ollama/ollama/releases/tag/v0.11.2
@kodiconnect unfortunately, yes. This is the end of ipex-llm. I have 60 t/s on a single A770 with llama.cpp (vulkan). And FA is working in b7063
Go to the OpenArc community, there is a SYCL maintainer.