ktransformers [Bug] can't deepseek 0528 version

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
[ ] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.

Describe the bug

I update to latest ktransformers( June 1st) and build with USE_BALANCE_SERVE=1 bash ./install.sh it can successfully build . Qwen3-235B can successfully be loaded and run. Deepseek V3 can successfully be loaded.

but DeepSeek R1-0528 can't be loaded. it will return error : invalid weight type

Reproduction

python ./ktransformers/local_chat.py --model_path ./DS0528-conf --gguf_path ./DS0528-UD-IQ1-S --cpuinfer 25 --max_new_tokens=3000 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-fp8-linear-ggml-experts.yaml

Environment

ubuntu 24.0 Intel Q870 +U9 285K 4090D NV

Jun 04 '25 09:06 AlbertG123

Hi @AlbertG123 , I think you are using the wrong inject yaml, this may need my colleague @Azure-Tang to take a look

Jun 05 '25 01:06 qiyuxinlin

Here is the step by step tutorial to run it: https://www.youtube.com/watch?v=Xui3_bA26LE and here is the written guide: https://github.com/Teachings/AIServerSetup/blob/main/06-DeepSeek-R1-0528/01-DeepSeek-R1-0528-KTransformers-Setup-Guide.md

Note: I have been unable to run it on .3.0 or .3.1 but it runs perfectly on 0.2.4.post1

Jun 05 '25 05:06 mtcl

Here is the step by step tutorial to run it: https://www.youtube.com/watch?v=Xui3_bA26LE and here is the written guide: https://github.com/Teachings/AIServerSetup/blob/main/06-DeepSeek-R1-0528/01-DeepSeek-R1-0528-KTransformers-Setup-Guide.md

Note: I have been unable to run it on .3.0 or .3.1 but it runs perfectly on 0.2.4.post1

Thank you very much. I have upgrade to latest KT module. so we may need KT team pay attention for this issue.

Jun 05 '25 05:06 AlbertG123

Can you share CUDA version, nvcc and step by step on which commands you ran to build it? I can try to reproduce it and find a fix.

Jun 05 '25 06:06 mtcl

Can you share CUDA version, nvcc and step by step on which commands you ran to build it? I can try to reproduce it and find a fix.

[AG] CUDA version is 12.8 , NVIDIA driver: 570.124.04. 1# completely following the guidance - https://kvcache-ai.github.io/ktransformers/en/install.html. 2# and run USE_BALANCE_SERVE=1 bash ./install.sh 3# download the Deepseek0528-UD-IQ1-S GGUF from https://huggingface.co/unsloth 4# download configuration file from https://huggingface.co/deepseek-ai/DeepSeek-R1 5# and then execute the python ./ktransformers/local_chat.py --model_path ./DS0528-conf --gguf_path ./DS0528-UD-IQ1-S --cpuinfer 25 --max_new_tokens=3000 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-fp8-linear-ggml-experts.yaml

Jun 05 '25 06:06 AlbertG123

What command did you use for qwen3 to start the server?

Jun 05 '25 06:06 mtcl

I copy this command: python ktransformers/server/main.py --architectures Qwen3MoeForCausalLM --model_path <model_dir> --gguf_path <gguf_dir> --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml --backend_type balance_serve

Jun 05 '25 06:06 AlbertG123

Here is the step by step tutorial to run it: https://www.youtube.com/watch?v=Xui3_bA26LE and here is the written guide: https://github.com/Teachings/AIServerSetup/blob/main/06-DeepSeek-R1-0528/01-DeepSeek-R1-0528-KTransformers-Setup-Guide.md

Note: I have been unable to run it on .3.0 or .3.1 but it runs perfectly on 0.2.4.post1

I've read your instruction, and it seems my rtx3090 cannot run ds-r1-0528 : (

Jun 05 '25 07:06 shinchou

Here is the step by step tutorial to run it: https://www.youtube.com/watch?v=Xui3_bA26LE and here is the written guide: https://github.com/Teachings/AIServerSetup/blob/main/06-DeepSeek-R1-0528/01-DeepSeek-R1-0528-KTransformers-Setup-Guide.md Note: I have been unable to run it on .3.0 or .3.1 but it runs perfectly on 0.2.4.post1

I've read your instruction, and it seems my rtx3090 cannot run ds-r1-0528 : (

RTX3090 should work, 5090 does not work. 4090 works perfectly though.

Jun 06 '25 06:06 mtcl

@Azure-Tang can you help check ? thank you very much

Jun 10 '25 02:06 AlbertG123

@Azure-Tang can you help check ? thank you very much

Hi, I think you are using fp8 yaml, which needs to load special weights.

For using IQ1s, you need to use DeepSeek-V3-Chat-serve.yaml ~

If you want to use hybrid fp8 mode for better performance, please check our fp8 tutorial.

Jun 10 '25 02:06 Azure-Tang

@Azure-Tang can you help check ? thank you very much

Hi, I think you are using fp8 yaml, which needs to load special weights.

For using IQ1s, you need to use DeepSeek-V3-Chat-serve.yaml ~

If you want to use hybrid fp8 mode for better performance, please check our fp8 tutorial.

也是报错， notimplementederror: ggml_type 18 not implement

Jun 16 '25 09:06 AlbertG123

@Azure-Tang can you help check ? thank you very much

Hi, I think you are using fp8 yaml, which needs to load special weights. For using IQ1s, you need to use DeepSeek-V3-Chat-serve.yaml ~

If you want to use hybrid fp8 mode for better performance, please check our fp8 tutorial.

也是报错， notimplementederror: ggml_type 18 not implement

请贴一下启动命令

Jun 26 '25 09:06 Azure-Tang

Ktransformers --model_path ./DeepSeekR10528-conf --gguf_path ./DS-R1-0528-IQ1_S --port 10002 --web True --max_new_tokens=3000 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml
/DeepSeek-V3-Chat-fp8-linear-ggml-experts.yaml 1&2 两个都不行报ggml_type 18 not implemented

Jul 03 '25 09:07 AlbertG123

@Azure-Tang 我用得GGUF 不是KVCache-ai/DeepSeek-V3-GGML-FP8-Hybrid . 我得找对应的0528 DeepSeek-R1-IQ1S-FP8 model-00000-of-00061.safetensors 这种格式的. 针对R1-0528 和 R1-T2 能给个Q1S-FP8 的model 吗?

Jul 07 '25 03:07 AlbertG123

@Azure-Tang 我用得GGUF 不是KVCache-ai/DeepSeek-V3-GGML-FP8-Hybrid . 我得找对应的0528 DeepSeek-R1-IQ1S-FP8 model-00000-of-00061.safetensors 这种格式的. 针对R1-0528 和 R1-T2 能给个Q1S-FP8 的model 吗?

所以你想是想跑混合精度的权重？gpu fp8，cpu ggml iq1？这样的话需要您根据教程自己造一份0528的权重

Jul 07 '25 07:07 Azure-Tang

好的, 自己试试吧. 最近的R1-T2 这个你们试过了吗? performance 有提升吗？

Jul 07 '25 07:07 AlbertG123

好的, 自己试试吧. 最近的R1-T2 这个你们试过了吗? performance 有提升吗？

还没试过。我刚刚看了一下huggingface的仓库，作者说“Unlike the original Chimera, which was based on the two parent models V3-0324 and R1, the new Chimera is a Tri-Mind with three parents, namely additionally R1-0528. ” 按我的理解它的模型架构应该和r1一致，那么应该不需要新的适配。如果遇到问题欢迎提新的issue～

Jul 07 '25 07:07 Azure-Tang

So did you manage to fix the error with ggml_type 18?

Jul 10 '25 05:07 TheLegendOfKitty

So did you manage to fix the error with ggml_type 18?

Hi, the error should not occur if using correct yaml and IQ1S weights.

Jul 10 '25 12:07 Azure-Tang

So did you manage to fix the error with ggml_type 18?

Hi, the error should not occur if using correct yaml and IQ1S weights.

And what's the correct yaml? Link to model I used: https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF

I'm using the 0.3.2-AVX512 container and this command:

python -m ktransformers.server.main --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --gguf_path /workspace/host/SSD/DeepSeek-R1-0528-UD-IQ1_S/ --model_path deepseek-ai/Deepseek-R1-0528 --backend_type balance_serve --use_cuda_graph --host 0.0.0.0 --port 8001

i get this error:

loading model.layers.0.self_attn.q_a_layernorm.weight to cuda:0
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/ktransformers/ktransformers/server/backend/interfaces/balance_serve.py", line 277, in run_engine
    engine = Engine(args, token_queue, broadcast_endpoint, kvcache_event)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ktransformers/ktransformers/server/backend/interfaces/balance_serve.py", line 181, in __init__
    optimize_and_load_gguf(self.model, optimize_config_path, gguf_path, config)
  File "/workspace/ktransformers/ktransformers/optimize/optimize.py", line 131, in optimize_and_load_gguf
    load_weights(module, weights_loader, device=default_device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights
    module.load()
  File "/workspace/ktransformers/ktransformers/operators/base_operator.py", line 63, in load
    utils.load_weights(child, self.gguf_loader, self.key+".")
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights
    module.load()
  File "/workspace/ktransformers/ktransformers/operators/base_operator.py", line 63, in load
    utils.load_weights(child, self.gguf_loader, self.key+".")
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights
    module.load()
  File "/workspace/ktransformers/ktransformers/operators/linear.py", line 937, in load
    self.generate_linear.load(w=w)
  File "/workspace/ktransformers/ktransformers/operators/linear.py", line 622, in load
    w = self.load_weight(device=device)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ktransformers/ktransformers/operators/linear.py", line 118, in load_weight
    tensors = self.load_multi(key, ["weight"], device=device)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ktransformers/ktransformers/operators/linear.py", line 128, in load_multi
    tensors[k] = self.gguf_loader.load_gguf_tensor(key + "." + k, device=device)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ktransformers/ktransformers/util/custom_loader.py", line 431, in load_gguf_tensor
    raise NotImplementedError(f"ggml_type {ggml_type} not implemented")
NotImplementedError: ggml_type 18 not implemented

Jul 10 '25 15:07 TheLegendOfKitty

So did you manage to fix the error with ggml_type 18?

Hi, the error should not occur if using correct yaml and IQ1S weights.

And what's the correct yaml? Link to model I used: https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF

I'm using the 0.3.2-AVX512 container and this command:

python -m ktransformers.server.main --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --gguf_path /workspace/host/SSD/DeepSeek-R1-0528-UD-IQ1_S/ --model_path deepseek-ai/Deepseek-R1-0528 --backend_type balance_serve --use_cuda_graph --host 0.0.0.0 --port 8001

i get this error:

loading model.layers.0.self_attn.q_a_layernorm.weight to cuda:0
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/ktransformers/ktransformers/server/backend/interfaces/balance_serve.py", line 277, in run_engine
    engine = Engine(args, token_queue, broadcast_endpoint, kvcache_event)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ktransformers/ktransformers/server/backend/interfaces/balance_serve.py", line 181, in __init__
    optimize_and_load_gguf(self.model, optimize_config_path, gguf_path, config)
  File "/workspace/ktransformers/ktransformers/optimize/optimize.py", line 131, in optimize_and_load_gguf
    load_weights(module, weights_loader, device=default_device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights
    module.load()
  File "/workspace/ktransformers/ktransformers/operators/base_operator.py", line 63, in load
    utils.load_weights(child, self.gguf_loader, self.key+".")
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights
    module.load()
  File "/workspace/ktransformers/ktransformers/operators/base_operator.py", line 63, in load
    utils.load_weights(child, self.gguf_loader, self.key+".")
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights
    module.load()
  File "/workspace/ktransformers/ktransformers/operators/linear.py", line 937, in load
    self.generate_linear.load(w=w)
  File "/workspace/ktransformers/ktransformers/operators/linear.py", line 622, in load
    w = self.load_weight(device=device)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ktransformers/ktransformers/operators/linear.py", line 118, in load_weight
    tensors = self.load_multi(key, ["weight"], device=device)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ktransformers/ktransformers/operators/linear.py", line 128, in load_multi
    tensors[k] = self.gguf_loader.load_gguf_tensor(key + "." + k, device=device)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ktransformers/ktransformers/util/custom_loader.py", line 431, in load_gguf_tensor
    raise NotImplementedError(f"ggml_type {ggml_type} not implemented")
NotImplementedError: ggml_type 18 not implemented

Thanks for the update. I’ve checked the new unsloth/0528-iq1s weights — it appears they use a new quantization format, IQ3_S, which isn’t currently supported by KTransformers. Supporting IQ3_S would require implementing a new matrix multiplication operator, which can't be completed in a short timeframe. If possible, I recommend switching to q4km weights instead, as they are already supported.

Jul 11 '25 03:07 Azure-Tang

So did you manage to fix the error with ggml_type 18?

Hi, the error should not occur if using correct yaml and IQ1S weights.

And what's the correct yaml? Link to model I used: https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF I'm using the 0.3.2-AVX512 container and this command: python -m ktransformers.server.main --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --gguf_path /workspace/host/SSD/DeepSeek-R1-0528-UD-IQ1_S/ --model_path deepseek-ai/Deepseek-R1-0528 --backend_type balance_serve --use_cuda_graph --host 0.0.0.0 --port 8001 i get this error:

loading model.layers.0.self_attn.q_a_layernorm.weight to cuda:0
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/ktransformers/ktransformers/server/backend/interfaces/balance_serve.py", line 277, in run_engine
    engine = Engine(args, token_queue, broadcast_endpoint, kvcache_event)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ktransformers/ktransformers/server/backend/interfaces/balance_serve.py", line 181, in __init__
    optimize_and_load_gguf(self.model, optimize_config_path, gguf_path, config)
  File "/workspace/ktransformers/ktransformers/optimize/optimize.py", line 131, in optimize_and_load_gguf
    load_weights(module, weights_loader, device=default_device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights
    module.load()
  File "/workspace/ktransformers/ktransformers/operators/base_operator.py", line 63, in load
    utils.load_weights(child, self.gguf_loader, self.key+".")
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights
    module.load()
  File "/workspace/ktransformers/ktransformers/operators/base_operator.py", line 63, in load
    utils.load_weights(child, self.gguf_loader, self.key+".")
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
    load_weights(child, gguf_loader, prefix+name+".", device=device)
  File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights
    module.load()
  File "/workspace/ktransformers/ktransformers/operators/linear.py", line 937, in load
    self.generate_linear.load(w=w)
  File "/workspace/ktransformers/ktransformers/operators/linear.py", line 622, in load
    w = self.load_weight(device=device)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ktransformers/ktransformers/operators/linear.py", line 118, in load_weight
    tensors = self.load_multi(key, ["weight"], device=device)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ktransformers/ktransformers/operators/linear.py", line 128, in load_multi
    tensors[k] = self.gguf_loader.load_gguf_tensor(key + "." + k, device=device)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/ktransformers/ktransformers/util/custom_loader.py", line 431, in load_gguf_tensor
    raise NotImplementedError(f"ggml_type {ggml_type} not implemented")
NotImplementedError: ggml_type 18 not implemented

Thanks for the update. I’ve checked the new unsloth/0528-iq1s weights — it appears they use a new quantization format, IQ3_S, which isn’t currently supported by KTransformers. Supporting IQ3_S would require implementing a new matrix multiplication operator, which can't be completed in a short timeframe. If possible, I recommend switching to q4km weights instead, as they are already supported.

I'm experiencing similiar errors with unsloth's DeepSeek-TNG-R1T2-Chimera-UD-IQ1_S quants (https://github.com/kvcache-ai/ktransformers/issues/1444) , is it related issue?

Jul 16 '25 05:07 lunzima