[Bug] Kimi K2 Thinking decode produces invalid tokens when using kt-kernel / kt-amx weights with cpu experts
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
- [x] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.
Describe the bug
After installing kt-kernel and using the command given in the readme, though the server starts up and responds to requests, the content of the K2-Thinking is clearly broken, just repeating tokens over and over.
I tried to change a few sizing options, and while it successfully is "decoding" the tokens its decoding don't seem to be right at all.
Not sure if this is a sglang or ktransformers bug though.
Reproduction
$ python -m sglang.launch_server --model moonshotai/Kimi-K2-Thinking --kt-amx-weight-path /path/to/home/.cache/huggingface/hub/models--KVCache-ai--Kimi-K2-Thinking-CPU-weight/snapshots/690ffacb9203d3b5e05ee8167ff1f5d4ae027c83 --kt-cpuinfer 128 --kt-threadpool-count 1 --kt-num-gpu-experts 200 --kt-amx-method AMXINT4 --trust-remote-code --mem-fraction-static 0.98 --chunked-prefill-size 4096 --max-running-requests 10 --max-total-tokens 37000 --enable-mixed-chunk --tensor-parallel-size 4 --enable-p2p-check --disable-shared-experts-fusion
To test, ran:
$ curl -N http://localhost:30000/v1/chat/completions -H "Content-Type: application/json" -H "Accept: text/event-stream" -d '{
"model": "moonshotai/Kimi-K2-Thinking",
"messages": [{"role": "user", "content": "Reply with exactly: hello"}],
"temperature": 0,
"stream": true
}'
Result (note the output is just repeating tokens "assistant assistant assistant" etc.)
data: {"id":"40204d5b7303457783dcb39f3e136ec3","object":"chat.completion.chunk","created":1762466824,"model":"moonshotai/Kimi-K2-Thinking","choices":[{"index":0,"delta":{"role":"assistant","content":"","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":null,"matched_stop":null}],"usage":null}
data: {"id":"40204d5b7303457783dcb39f3e136ec3","object":"chat.completion.chunk","created":1762466824,"model":"moonshotai/Kimi-K2-Thinking","choices":[{"index":0,"delta":{"role":null,"content":" actually","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":null,"matched_stop":null}],"usage":null}
data: {"id":"40204d5b7303457783dcb39f3e136ec3","object":"chat.completion.chunk","created":1762466824,"model":"moonshotai/Kimi-K2-Thinking","choices":[{"index":0,"delta":{"role":null,"content":" assistant","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":null,"matched_stop":null}],"usage":null}
data: {"id":"40204d5b7303457783dcb39f3e136ec3","object":"chat.completion.chunk","created":1762466824,"model":"moonshotai/Kimi-K2-Thinking","choices":[{"index":0,"delta":{"role":null,"content":" assistant","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":null,"matched_stop":null}],"usage":null}
data: {"id":"40204d5b7303457783dcb39f3e136ec3","object":"chat.completion.chunk","created":1762466824,"model":"moonshotai/Kimi-K2-Thinking","choices":[{"index":0,"delta":{"role":null,"content":" assistant","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":null,"matched_stop":null}],"usage":null}
data: {"id":"40204d5b7303457783dcb39f3e136ec3","object":"chat.completion.chunk","created":1762466824,"model":"moonshotai/Kimi-K2-Thinking","choices":[{"index":0,"delta":{"role":null,"content":" t","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":null,"matched_stop":null}],"usage":null}
...
Environment
OS: Ubuntu GPU: 4xRTX Pro 6000 Blackwell CPU: AMD Ryzen Threadripper PRO 9995WX 96-Cores RAM: 1TB
installed kt-kernel and sglang in the same virtualenv
uv venv
source .venv/bin/activate
uv pip install "sglang[all]"
uv pip install git+https://github.com/kvcache-ai/ktransformers/tree/main/kt-kernel
(also, I installed ktransformers as well,
export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=100 -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc"
uv pip install git+https://github.com/kvcache-ai/ktransformers.git
Also, for external dependencies i installed:
sudo apt-get install -y libblis-openmp-dev libhwloc-dev libnuma-dev pkg-config
Nvidia information:
NVIDIA-SMI 580.95.05 Driver Version: 580.95.05
Try --kt-threadpool-count 2 instead of --kt-threadpool-count 1. It fixed the issue for me on single Epyc with 4x 6000 Pros.
I did, haha. The command above includes --kt-threadpool-count 1. My issue wasn't the NUMA, it was that the output tokens were incoherent.
When you got it running witht he single epyc and 4x 6000 pros, did you get valid tokens out? I was able to get ~20 tokens per second but it was just gibberish.
Yes, i've got it tuned for about 36 tokens/sec and it's looking really good. Mind you, it is only coherent with --kt-threadpool-count 2. At 1 it spews garbage, like you said.
Also --ep doesn't work, it writes garbage with that, too.
Coming back to this a few days later: My guess is that using an AMD processor is the main issue, as the AMXINT4 is an Intel specific instruction (?). That said, turning off --kt-amx-method AMXINT4, it still doesn't work.
It's clearly successfully executing some fallback (it's not hard crashing), but the fallback is seeing the weights in some invalid way / incorrectly parsing them, so it generates gibberish.