ktransformers icon indicating copy to clipboard operation
ktransformers copied to clipboard

[Bug] Kimi K2 Thinking only produces coherent output on first query, on small context

Open trilog-inc opened this issue 1 month ago • 2 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
  • [x] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.

Describe the bug

Thanks for all the hard work! I am very excited to keep testing the new SGLAng integration

My system is the following:

Intel W7-3455 768 DDR5 3x3090 RTX + 1x4090 RTX

I successfully built and got the the inference server to run and return coherent output. However, when doing follow up queries, the response are gibberish and never contain the < think > tags. Happens with triton and flashinfer.

Query: Tell me a random fun fact about the English Empire 

Answer 1
<think>    The user is asking for a random fun fact about the "English Empire". I should note that "English Empire" is not a standard term - it's usually referred to as the "British Empire" (which included England, Scotland, Wales, and later Ireland). However, I'll interpret this as referring to the British Empire, as that's what people typically mean, and I should probably clarify this subtly. 

Let me think of some interesting, lesser-known facts about the British Empire: 

The Empire was so large that it was said "the sun never sets on the British Empire" - but that's well-known.
- The British Empire was the largest empire in history, covering about 25% of Earth's land surface.
- They introduced cricket and tea to many countries.
- The British obsession with tea led to the Opium Wars with China.
- The Empire had some strange laws and customs.
- There were some unusual territorial disputes.
- The empire had a significant impact on language.
         

Let me think of something more specific and "fun": 

- The British Empire once had a "Vice-Roy of India" - that's well known.
- The British took the Koh-i-Noor diamond from India.
- They had a prison colony in Australia.
- The British used to transport prisoners to colonies, but also had the practice of transporting convicts to America before the Revolution.
         

Here's a good one: The British Empire once declared war on the small country of Zanzibar, and the war lasted only 38 minutes - making it the shortest war in history. That's definitely a fun fact! 

Let me verify: The Anglo-Zanzibar War of 1896 lasted between 38 and 45 minutes (sources vary but 38 is commonly cited). This is indeed considered the shortest war in recorded history. 

This is perfect because: 

- It's factual and verifiable
- It's interesting and memorable
- It's somewhat obscure but well-documented
- It fits the "fun fact" criteria
         

I should present it in an engaging way, and I should also gently correct the "English Empire" to "British Empire" without being pedantic. I'll make it clear and interesting. 

I could also mention other facts like: 

- The British Empire was responsible for the phrase "white elephant" coming into common usage
- The British once had a "British Antarctic Territory"
- The empire's postal system was incredibly efficient
         

- But the Zanzibar war is definitely the most fun and memorable. I'll go with that one. 
  
</think>
When the pro-British Sultan of Zanzibar died, his successor, Khalid bin Barghash, took power without British approval. The British gave him an ultimatum to stand down. When he refused, British warships opened fire on his palace at 9:00 AM. By 9:38 AM, the Sultan's forces had surrendered, his palace was in ruins, and he had fled to the German consulate. The British then installed a more favorable ruler.

Query 2: What is a good british meal? 

Answer 2: 

Candlelight was Active Creation Myth: Invention Myth: A Course in Creation Myth: A Course in Creation: Up: A Course in Creation: The: A Course in the Course: A Course in Course A Course in Course: A Course in Course: The Course: A Course in The Course: A Course in Course: The Course: A Course in the Course: A Course: The Course: A Course in the Course: A Course in The Course: A Course in the Course: A Course in the Course: A Course: The Course: A Course in The Course: A Course in the Course: A Course in the Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in the Course: A Course in The Course: A Course in the Course: A Course in The Course: A Course in The Course: The Course: A Course in The Course: A Course in The Course: The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A Course in The Course: A

Temperature: 1.0 top_k: 50 top_p: 1.0

Another case, using a derivative of the Thireus DipiloBlop prompt:

Thireus DipiloBlop Prompt.md

Output:

 on the site.
Jk the Next.
Now, we will move to the next.

 使用本网站(提供的)邮件服务,构建一个邮件应用。 发布者: 老猫(the future),出处:the future(未来)。  
  2014-11-04 14:55:30 发布者: the future(未来)

“该邮件为创建邮件发布者(用户)使用。该用户为创建邮件发布者(用户)。”

(The user is a user of the user, and the user is the user of the user).

This is the same as that of the user. This is the same as that of the user. This is the same as that of the user. This is the same as that of the user.
The user of the user is the same as that of the user. The user of the user is the same as that of the user. The user of the user is the same as that of the user.

The user of the user is the same as that of the user. The user of the user is the same as that of the user.

The user of the user is the same as that of the user. The user of the user is the same as that of the user.

Any idea what this could be? I will be testing with just 2 RTX 3090 to validate if the mismatched 4th GPU ( 4090 ) is causing issue. Next tests will also be with python 3.11 to see if that makes a difference too. i will be updating this thread with my findings.

Reproduction

python -m sglang.launch_server --host 0.0.0.0 --port 60000 --model /mnt/storage/models/kimik2think/ --kt-amx-weight-path /mnt/storage/models/kimik2thinkcpu/ --kt-cpuinfer 20 --kt-threadpool-count 2 --kt-num-gpu-experts 4 --kt-amx-method AMXINT4 --attention-backend flashinfer --trust-remote-code --mem-fraction-static 0.98 --chunked-prefill-size 4096 --max-running-requests 1 --max-total-tokens 32768 --enable-mixed-chunk --tensor-parallel-size 4 --enable-p2p-check --disable-shared-experts-fusion --served-model-name kimi

Environment

Driver Version: 570.124.06 CUDA Version: 12.8 python3.12 torch 2.8.0+cu128

trilog-inc avatar Nov 10 '25 03:11 trilog-inc

Here we go again. :D

I said it many times. And I will repeat it. The ktransformers rely on flashinfer. flashinfer is unstable. Hence the ktransformers are too. That has been the state of affairs since DeepSeek-R1. Nothing has changed and nothing was addressed. It just doesn't work.

magikRUKKOLA avatar Nov 10 '25 22:11 magikRUKKOLA

python -m sglang.launch_server

But how that reproduction relates to the ktransformers? sglang uses flashinfer too? Oh. Yes, it does. lol . Good luck with that.

magikRUKKOLA avatar Nov 10 '25 22:11 magikRUKKOLA

Updating and following the latest recommendations for install. All works! Kimi-K2-Thinking is prefilling at ~270t/s and decoding at 9.8t/s.

the --kt-threadpool-count command would only work a setting of 2 ( 1 would output gibberish ). I enabled the virtual NUMA of my system to 2 and everything seems to be working well.

trilog-inc avatar Nov 18 '25 03:11 trilog-inc