ktransformers icon indicating copy to clipboard operation
ktransformers copied to clipboard

Add support for kimi-k2-0905 [Tracking]

Open Azure-Tang opened this issue 3 months ago • 5 comments

The newly released kimi-k2-0905 model is now supported by the ktransformers framework. 📌 We will update the supported models list to include kimi-k2-0905. 🔄 We are also working on manual GGUF conversions and uploading them to Hugging Face. Please watch this issue for the latest updates and links once the converted models are available.

How to run

Prepare model

The gguf format is still uploading, but we also support bf16 safetensors. Please use convert tool to obtain bf16 safetensors.

For more details please refer to Kimi-K2.

Run

python ktransformers/server/main.py \
  --port 10002 \
  --model_path <path_to_safetensor_config> \
  --gguf_path <path_to_gguf_files> \
  --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml \
  --max_new_tokens 1024 \
  --cache_lens 32768 \
  --chunk_size 256 \
  --max_batch_size 4 \
  --backend_type balance_serve

Azure-Tang avatar Sep 05 '25 03:09 Azure-Tang

Is there any updates on the AMX int4 progress?

trilog-inc avatar Sep 08 '25 00:09 trilog-inc

Is there any updates on the AMX int4 progress?

You may find it on SOSP branch, which will be merged after being fully tested.

Azure-Tang avatar Sep 09 '25 03:09 Azure-Tang

I will take a look and looking forward to it!

trilog-inc avatar Sep 09 '25 04:09 trilog-inc

Pulled the SOSP branch and ran some quick tests with DeepseekR1 with AMXInt4. Loading takes quite a while ( 3 hours ) as it converts the quant during loading from BF16, but the results are impressive. Long context would fail with a CUDA out of memory ( > ~32K ), but I could reliably get the following Performance(T/s): prefill 46.00295403694009, decode 12.277004652609582 with smaller context. The prefill speed is an order of magnitude off from the results published, but the setups are different. I will try a multi GPU config in the next few days to see if the CUDA OOM error could be avoided. The goal would be to run V3.1, and maybe Kimi 0905 if the stability is there. The latest updates broke the cache for me so I would have to look into that as well. The enable_thinking parameter logic is missing too but that could be straight forward. It would be great if we could use a quant model instead of BF16 too for loading times.

All in all, impressed and can't wait for the next updates!

W790 Sage Xeon w7-3455 512GB DDR5 @ 4800 ( mlc at ~240gb/s) RTX 4090 ( + 3x 3090 )

trilog-inc avatar Sep 11 '25 14:09 trilog-inc

Little update on this. I ran the AMXInt4 with multi-gpu and I was able to load an even larger context. At 40K context, prefill speed got up to 141T/s ( very impressed!) and decode decreased to about 10.5 T/s ( using 1x 4090 and 1x 3090 ). When I tried testing a context of 70K, I hit the infamous "torch.AcceleratorError: CUDA error: an illegal memory access was encountered" error and the backend broke ( https://github.com/kvcache-ai/ktransformers/issues/1417 ). Since it takes 3+ hours to load the model, I haven't tried to dig deeper in this issue. I am using the ktransformers backend with a cache_lens of 100000, a max_new_tokens of 30000 and a chunk_size of 512.

Is there a known cause/solution to this issue that I can try to implement? Is there a known combination of parameters that cause this more frequently?

Again, thanks for all your hardwork on this.

trilog-inc avatar Sep 22 '25 15:09 trilog-inc