[Bug] can't deepseek 0528 version
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
- [ ] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.
Describe the bug
I update to latest ktransformers( June 1st) and build with USE_BALANCE_SERVE=1 bash ./install.sh it can successfully build . Qwen3-235B can successfully be loaded and run. Deepseek V3 can successfully be loaded.
but DeepSeek R1-0528 can't be loaded. it will return error : invalid weight type
Reproduction
python ./ktransformers/local_chat.py --model_path ./DS0528-conf --gguf_path ./DS0528-UD-IQ1-S --cpuinfer 25 --max_new_tokens=3000 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-fp8-linear-ggml-experts.yaml
Environment
ubuntu 24.0 Intel Q870 +U9 285K 4090D NV
Hi @AlbertG123 , I think you are using the wrong inject yaml, this may need my colleague @Azure-Tang to take a look
Here is the step by step tutorial to run it: https://www.youtube.com/watch?v=Xui3_bA26LE and here is the written guide: https://github.com/Teachings/AIServerSetup/blob/main/06-DeepSeek-R1-0528/01-DeepSeek-R1-0528-KTransformers-Setup-Guide.md
Note: I have been unable to run it on .3.0 or .3.1 but it runs perfectly on 0.2.4.post1
Here is the step by step tutorial to run it: https://www.youtube.com/watch?v=Xui3_bA26LE and here is the written guide: https://github.com/Teachings/AIServerSetup/blob/main/06-DeepSeek-R1-0528/01-DeepSeek-R1-0528-KTransformers-Setup-Guide.md
Note: I have been unable to run it on .3.0 or .3.1 but it runs perfectly on 0.2.4.post1
Thank you very much. I have upgrade to latest KT module. so we may need KT team pay attention for this issue.
Can you share CUDA version, nvcc and step by step on which commands you ran to build it? I can try to reproduce it and find a fix.
Can you share CUDA version, nvcc and step by step on which commands you ran to build it? I can try to reproduce it and find a fix.
[AG] CUDA version is 12.8 , NVIDIA driver: 570.124.04. 1# completely following the guidance - https://kvcache-ai.github.io/ktransformers/en/install.html. 2# and run USE_BALANCE_SERVE=1 bash ./install.sh 3# download the Deepseek0528-UD-IQ1-S GGUF from https://huggingface.co/unsloth 4# download configuration file from https://huggingface.co/deepseek-ai/DeepSeek-R1 5# and then execute the python ./ktransformers/local_chat.py --model_path ./DS0528-conf --gguf_path ./DS0528-UD-IQ1-S --cpuinfer 25 --max_new_tokens=3000 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-fp8-linear-ggml-experts.yaml
What command did you use for qwen3 to start the server?
I copy this command: python ktransformers/server/main.py --architectures Qwen3MoeForCausalLM --model_path <model_dir> --gguf_path <gguf_dir> --optimize_config_path ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml --backend_type balance_serve
Here is the step by step tutorial to run it: https://www.youtube.com/watch?v=Xui3_bA26LE and here is the written guide: https://github.com/Teachings/AIServerSetup/blob/main/06-DeepSeek-R1-0528/01-DeepSeek-R1-0528-KTransformers-Setup-Guide.md
Note: I have been unable to run it on .3.0 or .3.1 but it runs perfectly on 0.2.4.post1
I've read your instruction, and it seems my rtx3090 cannot run ds-r1-0528 : (
Here is the step by step tutorial to run it: https://www.youtube.com/watch?v=Xui3_bA26LE and here is the written guide: https://github.com/Teachings/AIServerSetup/blob/main/06-DeepSeek-R1-0528/01-DeepSeek-R1-0528-KTransformers-Setup-Guide.md Note: I have been unable to run it on .3.0 or .3.1 but it runs perfectly on 0.2.4.post1
I've read your instruction, and it seems my rtx3090 cannot run ds-r1-0528 : (
RTX3090 should work, 5090 does not work. 4090 works perfectly though.
@Azure-Tang can you help check ? thank you very much
@Azure-Tang can you help check ? thank you very much
Hi, I think you are using fp8 yaml, which needs to load special weights.
For using IQ1s, you need to use DeepSeek-V3-Chat-serve.yaml ~
If you want to use hybrid fp8 mode for better performance, please check our fp8 tutorial.
@Azure-Tang can you help check ? thank you very much
Hi, I think you are using
fp8yaml, which needs to load special weights.For using IQ1s, you need to use DeepSeek-V3-Chat-serve.yaml ~
If you want to use hybrid fp8 mode for better performance, please check our fp8 tutorial.
也是报错, notimplementederror: ggml_type 18 not implement
@Azure-Tang can you help check ? thank you very much
Hi, I think you are using
fp8yaml, which needs to load special weights. For using IQ1s, you need to use DeepSeek-V3-Chat-serve.yaml ~If you want to use hybrid fp8 mode for better performance, please check our fp8 tutorial.
也是报错, notimplementederror: ggml_type 18 not implement
请贴一下启动命令
- Ktransformers --model_path ./DeepSeekR10528-conf --gguf_path ./DS-R1-0528-IQ1_S --port 10002 --web True --max_new_tokens=3000 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml
- /DeepSeek-V3-Chat-fp8-linear-ggml-experts.yaml 1&2 两个都不行 报ggml_type 18 not implemented
@Azure-Tang 我用得GGUF 不是KVCache-ai/DeepSeek-V3-GGML-FP8-Hybrid . 我得找对应的0528 DeepSeek-R1-IQ1S-FP8 model-00000-of-00061.safetensors 这种格式的. 针对R1-0528 和 R1-T2 能给个Q1S-FP8 的model 吗?
@Azure-Tang 我用得GGUF 不是KVCache-ai/DeepSeek-V3-GGML-FP8-Hybrid . 我得找对应的0528 DeepSeek-R1-IQ1S-FP8 model-00000-of-00061.safetensors 这种格式的. 针对R1-0528 和 R1-T2 能给个Q1S-FP8 的model 吗?
所以你想是想跑混合精度的权重?gpu fp8,cpu ggml iq1?这样的话需要您根据教程自己造一份0528的权重
好的, 自己试试吧. 最近的R1-T2 这个你们试过了吗? performance 有提升吗?
好的, 自己试试吧. 最近的R1-T2 这个你们试过了吗? performance 有提升吗?
还没试过。我刚刚看了一下huggingface的仓库,作者说“Unlike the original Chimera, which was based on the two parent models V3-0324 and R1, the new Chimera is a Tri-Mind with three parents, namely additionally R1-0528. ” 按我的理解它的模型架构应该和r1一致,那么应该不需要新的适配。如果遇到问题欢迎提新的issue~
So did you manage to fix the error with ggml_type 18?
So did you manage to fix the error with ggml_type 18?
Hi, the error should not occur if using correct yaml and IQ1S weights.
So did you manage to fix the error with ggml_type 18?
Hi, the error should not occur if using correct yaml and IQ1S weights.
And what's the correct yaml? Link to model I used: https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF
I'm using the 0.3.2-AVX512 container and this command:
python -m ktransformers.server.main --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --gguf_path /workspace/host/SSD/DeepSeek-R1-0528-UD-IQ1_S/ --model_path deepseek-ai/Deepseek-R1-0528 --backend_type balance_serve --use_cuda_graph --host 0.0.0.0 --port 8001
i get this error:
loading model.layers.0.self_attn.q_a_layernorm.weight to cuda:0
Process SpawnProcess-1:
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/workspace/ktransformers/ktransformers/server/backend/interfaces/balance_serve.py", line 277, in run_engine
engine = Engine(args, token_queue, broadcast_endpoint, kvcache_event)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/ktransformers/ktransformers/server/backend/interfaces/balance_serve.py", line 181, in __init__
optimize_and_load_gguf(self.model, optimize_config_path, gguf_path, config)
File "/workspace/ktransformers/ktransformers/optimize/optimize.py", line 131, in optimize_and_load_gguf
load_weights(module, weights_loader, device=default_device)
File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
load_weights(child, gguf_loader, prefix+name+".", device=device)
File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights
module.load()
File "/workspace/ktransformers/ktransformers/operators/base_operator.py", line 63, in load
utils.load_weights(child, self.gguf_loader, self.key+".")
File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
load_weights(child, gguf_loader, prefix+name+".", device=device)
File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
load_weights(child, gguf_loader, prefix+name+".", device=device)
File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
load_weights(child, gguf_loader, prefix+name+".", device=device)
File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights
module.load()
File "/workspace/ktransformers/ktransformers/operators/base_operator.py", line 63, in load
utils.load_weights(child, self.gguf_loader, self.key+".")
File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights
load_weights(child, gguf_loader, prefix+name+".", device=device)
File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights
module.load()
File "/workspace/ktransformers/ktransformers/operators/linear.py", line 937, in load
self.generate_linear.load(w=w)
File "/workspace/ktransformers/ktransformers/operators/linear.py", line 622, in load
w = self.load_weight(device=device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/ktransformers/ktransformers/operators/linear.py", line 118, in load_weight
tensors = self.load_multi(key, ["weight"], device=device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/ktransformers/ktransformers/operators/linear.py", line 128, in load_multi
tensors[k] = self.gguf_loader.load_gguf_tensor(key + "." + k, device=device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/ktransformers/ktransformers/util/custom_loader.py", line 431, in load_gguf_tensor
raise NotImplementedError(f"ggml_type {ggml_type} not implemented")
NotImplementedError: ggml_type 18 not implemented
So did you manage to fix the error with ggml_type 18?
Hi, the error should not occur if using correct yaml and IQ1S weights.
And what's the correct yaml? Link to model I used: https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF
I'm using the 0.3.2-AVX512 container and this command:
python -m ktransformers.server.main --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --gguf_path /workspace/host/SSD/DeepSeek-R1-0528-UD-IQ1_S/ --model_path deepseek-ai/Deepseek-R1-0528 --backend_type balance_serve --use_cuda_graph --host 0.0.0.0 --port 8001i get this error:
loading model.layers.0.self_attn.q_a_layernorm.weight to cuda:0 Process SpawnProcess-1: Traceback (most recent call last): File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/workspace/ktransformers/ktransformers/server/backend/interfaces/balance_serve.py", line 277, in run_engine engine = Engine(args, token_queue, broadcast_endpoint, kvcache_event) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/ktransformers/ktransformers/server/backend/interfaces/balance_serve.py", line 181, in __init__ optimize_and_load_gguf(self.model, optimize_config_path, gguf_path, config) File "/workspace/ktransformers/ktransformers/optimize/optimize.py", line 131, in optimize_and_load_gguf load_weights(module, weights_loader, device=default_device) File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights load_weights(child, gguf_loader, prefix+name+".", device=device) File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights module.load() File "/workspace/ktransformers/ktransformers/operators/base_operator.py", line 63, in load utils.load_weights(child, self.gguf_loader, self.key+".") File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights load_weights(child, gguf_loader, prefix+name+".", device=device) File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights load_weights(child, gguf_loader, prefix+name+".", device=device) File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights load_weights(child, gguf_loader, prefix+name+".", device=device) File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights module.load() File "/workspace/ktransformers/ktransformers/operators/base_operator.py", line 63, in load utils.load_weights(child, self.gguf_loader, self.key+".") File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights load_weights(child, gguf_loader, prefix+name+".", device=device) File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights module.load() File "/workspace/ktransformers/ktransformers/operators/linear.py", line 937, in load self.generate_linear.load(w=w) File "/workspace/ktransformers/ktransformers/operators/linear.py", line 622, in load w = self.load_weight(device=device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/ktransformers/ktransformers/operators/linear.py", line 118, in load_weight tensors = self.load_multi(key, ["weight"], device=device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/ktransformers/ktransformers/operators/linear.py", line 128, in load_multi tensors[k] = self.gguf_loader.load_gguf_tensor(key + "." + k, device=device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/ktransformers/ktransformers/util/custom_loader.py", line 431, in load_gguf_tensor raise NotImplementedError(f"ggml_type {ggml_type} not implemented") NotImplementedError: ggml_type 18 not implemented
Thanks for the update. I’ve checked the new unsloth/0528-iq1s weights — it appears they use a new quantization format, IQ3_S, which isn’t currently supported by KTransformers. Supporting IQ3_S would require implementing a new matrix multiplication operator, which can't be completed in a short timeframe. If possible, I recommend switching to q4km weights instead, as they are already supported.
So did you manage to fix the error with ggml_type 18?
Hi, the error should not occur if using correct yaml and IQ1S weights.
And what's the correct yaml? Link to model I used: https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF I'm using the 0.3.2-AVX512 container and this command:
python -m ktransformers.server.main --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --gguf_path /workspace/host/SSD/DeepSeek-R1-0528-UD-IQ1_S/ --model_path deepseek-ai/Deepseek-R1-0528 --backend_type balance_serve --use_cuda_graph --host 0.0.0.0 --port 8001i get this error:loading model.layers.0.self_attn.q_a_layernorm.weight to cuda:0 Process SpawnProcess-1: Traceback (most recent call last): File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/opt/conda/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/workspace/ktransformers/ktransformers/server/backend/interfaces/balance_serve.py", line 277, in run_engine engine = Engine(args, token_queue, broadcast_endpoint, kvcache_event) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/ktransformers/ktransformers/server/backend/interfaces/balance_serve.py", line 181, in __init__ optimize_and_load_gguf(self.model, optimize_config_path, gguf_path, config) File "/workspace/ktransformers/ktransformers/optimize/optimize.py", line 131, in optimize_and_load_gguf load_weights(module, weights_loader, device=default_device) File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights load_weights(child, gguf_loader, prefix+name+".", device=device) File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights module.load() File "/workspace/ktransformers/ktransformers/operators/base_operator.py", line 63, in load utils.load_weights(child, self.gguf_loader, self.key+".") File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights load_weights(child, gguf_loader, prefix+name+".", device=device) File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights load_weights(child, gguf_loader, prefix+name+".", device=device) File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights load_weights(child, gguf_loader, prefix+name+".", device=device) File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights module.load() File "/workspace/ktransformers/ktransformers/operators/base_operator.py", line 63, in load utils.load_weights(child, self.gguf_loader, self.key+".") File "/workspace/ktransformers/ktransformers/util/utils.py", line 174, in load_weights load_weights(child, gguf_loader, prefix+name+".", device=device) File "/workspace/ktransformers/ktransformers/util/utils.py", line 176, in load_weights module.load() File "/workspace/ktransformers/ktransformers/operators/linear.py", line 937, in load self.generate_linear.load(w=w) File "/workspace/ktransformers/ktransformers/operators/linear.py", line 622, in load w = self.load_weight(device=device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/ktransformers/ktransformers/operators/linear.py", line 118, in load_weight tensors = self.load_multi(key, ["weight"], device=device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/ktransformers/ktransformers/operators/linear.py", line 128, in load_multi tensors[k] = self.gguf_loader.load_gguf_tensor(key + "." + k, device=device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/ktransformers/ktransformers/util/custom_loader.py", line 431, in load_gguf_tensor raise NotImplementedError(f"ggml_type {ggml_type} not implemented") NotImplementedError: ggml_type 18 not implementedThanks for the update. I’ve checked the new unsloth/0528-iq1s weights — it appears they use a new quantization format, IQ3_S, which isn’t currently supported by KTransformers. Supporting IQ3_S would require implementing a new matrix multiplication operator, which can't be completed in a short timeframe. If possible, I recommend switching to q4km weights instead, as they are already supported.
I'm experiencing similiar errors with unsloth's DeepSeek-TNG-R1T2-Chimera-UD-IQ1_S quants (https://github.com/kvcache-ai/ktransformers/issues/1444) , is it related issue?