ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

GPT-OSS 120B ERROR llama.cpp ipex-llm==2.3.0b20251104

Open savvadesogle opened this issue 1 month ago • 5 comments

Hello, latest ipex-llm ( ipex-llm==2.3.0b20251104) has issue with models: gpt-oss 120b, gpt-oss 20B

ENV

(ipex-llm) arc@xpu:~/llm/ipex-llm/llama-cpp$ uv pip install --pre --upgrade ipex-llm[cpp]
Using Python 3.11.14 environment at: /home/arc/miniconda3/envs/ipex-llm
Resolved 45 packages in 1.12s
Prepared 9 packages in 1.84s
Uninstalled 9 packages in 244ms
Installed 9 packages in 218ms
 - bigdl-core-cpp==2.7.0b20251022
 + bigdl-core-cpp==2.7.0b20251104
 - fsspec==2025.9.0
 + fsspec==2025.10.0
 - hf-xet==1.1.11b1
 + hf-xet==1.2.0
 - huggingface-hub==0.36.0rc0
 + huggingface-hub==0.36.0
 - ipex-llm==2.3.0b20251022
 + ipex-llm==2.3.0b20251104
 - networkx==3.5
 + networkx==3.6rc0
 - psutil==7.1.1
 + psutil==7.1.3
 - regex==2025.10.23
 + regex==2025.11.3
 - safetensors==0.6.2
 + safetensors==0.7.0rc0

llama-server

tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE)

(ipex-llm) arc@xpu:~/llm/ipex-llm/llama-cpp$ ./llama-server -m ~/llm/models/UD-Q8_K_XL/gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf -ngl 99 -c 8192 --jinja
build: 1 (98abe88) with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
system info: n_threads = 36, n_threads_batch = 36, total_threads = 72

system_info: n_threads = 36 (n_threads_batch = 36) / 72 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 71
main: loading model
srv    load_model: loading model '/home/arc/llm/models/UD-Q8_K_XL/gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf'
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
llama_model_load_from_file_impl: using device SYCL1 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
gguf_init_from_file_impl: failed to read tensor info
llama_model_load: error loading model: llama_model_loader: failed to load model from /home/arc/llm/models/UD-Q8_K_XL/gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/home/arc/llm/models/UD-Q8_K_XL/gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf'
srv    load_model: failed to load model, '/home/arc/llm/models/UD-Q8_K_XL/gpt-oss-120b-UD-Q8_K_XL-00001-of-00002.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

savvadesogle avatar Nov 05 '25 08:11 savvadesogle

Facing the same issue

hmfaysal avatar Nov 07 '25 08:11 hmfaysal

Hello SONG Ge @sgwhat Is there any information about when this might appear in supported models?

savvadesogle avatar Nov 22 '25 17:11 savvadesogle

Not sure. I am not working on developing it recently.

sgwhat avatar Nov 24 '25 02:11 sgwhat

The fix was added to an August release of ollama, so I'm guessing we have to wait for the now seemingly defunct intel team to roll in the ollama fix into IPEX. I'm thinking this project is on the back burner or they fired the talented staff that was working on this.

https://github.com/ollama/ollama/releases/tag/v0.11.2

kodiconnect avatar Dec 01 '25 21:12 kodiconnect

@kodiconnect unfortunately, yes. This is the end of ipex-llm. I have 60 t/s on a single A770 with llama.cpp (vulkan). And FA is working in b7063

Go to the OpenArc community, there is a SYCL maintainer.

savvadesogle avatar Dec 01 '25 21:12 savvadesogle