ktransformers icon indicating copy to clipboard operation
ktransformers copied to clipboard

[Bug] Kimi K2 Thinking decode produces invalid tokens when using kt-kernel / kt-amx weights with cpu experts

Open bluecoconut opened this issue 1 month ago • 4 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
  • [x] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.

Describe the bug

After installing kt-kernel and using the command given in the readme, though the server starts up and responds to requests, the content of the K2-Thinking is clearly broken, just repeating tokens over and over.

I tried to change a few sizing options, and while it successfully is "decoding" the tokens its decoding don't seem to be right at all.

Not sure if this is a sglang or ktransformers bug though.

Reproduction

$ python -m sglang.launch_server   --model moonshotai/Kimi-K2-Thinking   --kt-amx-weight-path /path/to/home/.cache/huggingface/hub/models--KVCache-ai--Kimi-K2-Thinking-CPU-weight/snapshots/690ffacb9203d3b5e05ee8167ff1f5d4ae027c83  --kt-cpuinfer 128   --kt-threadpool-count 1   --kt-num-gpu-experts 200   --kt-amx-method AMXINT4   --trust-remote-code   --mem-fraction-static 0.98   --chunked-prefill-size 4096   --max-running-requests 10   --max-total-tokens 37000   --enable-mixed-chunk   --tensor-parallel-size 4   --enable-p2p-check   --disable-shared-experts-fusion

To test, ran:

$ curl -N http://localhost:30000/v1/chat/completions   -H "Content-Type: application/json"   -H "Accept: text/event-stream"   -d '{
    "model": "moonshotai/Kimi-K2-Thinking",
    "messages": [{"role": "user", "content": "Reply with exactly: hello"}],
    "temperature": 0,
    "stream": true
  }'

Result (note the output is just repeating tokens "assistant assistant assistant" etc.)

data: {"id":"40204d5b7303457783dcb39f3e136ec3","object":"chat.completion.chunk","created":1762466824,"model":"moonshotai/Kimi-K2-Thinking","choices":[{"index":0,"delta":{"role":"assistant","content":"","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":null,"matched_stop":null}],"usage":null}

data: {"id":"40204d5b7303457783dcb39f3e136ec3","object":"chat.completion.chunk","created":1762466824,"model":"moonshotai/Kimi-K2-Thinking","choices":[{"index":0,"delta":{"role":null,"content":" actually","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":null,"matched_stop":null}],"usage":null}

data: {"id":"40204d5b7303457783dcb39f3e136ec3","object":"chat.completion.chunk","created":1762466824,"model":"moonshotai/Kimi-K2-Thinking","choices":[{"index":0,"delta":{"role":null,"content":" assistant","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":null,"matched_stop":null}],"usage":null}

data: {"id":"40204d5b7303457783dcb39f3e136ec3","object":"chat.completion.chunk","created":1762466824,"model":"moonshotai/Kimi-K2-Thinking","choices":[{"index":0,"delta":{"role":null,"content":" assistant","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":null,"matched_stop":null}],"usage":null}

data: {"id":"40204d5b7303457783dcb39f3e136ec3","object":"chat.completion.chunk","created":1762466824,"model":"moonshotai/Kimi-K2-Thinking","choices":[{"index":0,"delta":{"role":null,"content":" assistant","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":null,"matched_stop":null}],"usage":null}

data: {"id":"40204d5b7303457783dcb39f3e136ec3","object":"chat.completion.chunk","created":1762466824,"model":"moonshotai/Kimi-K2-Thinking","choices":[{"index":0,"delta":{"role":null,"content":" t","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":null,"matched_stop":null}],"usage":null}
...

Environment

OS: Ubuntu GPU: 4xRTX Pro 6000 Blackwell CPU: AMD Ryzen Threadripper PRO 9995WX 96-Cores RAM: 1TB

installed kt-kernel and sglang in the same virtualenv

uv venv
source .venv/bin/activate
uv pip install "sglang[all]"
uv pip install git+https://github.com/kvcache-ai/ktransformers/tree/main/kt-kernel

(also, I installed ktransformers as well,

export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=100 -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc" 
uv pip install git+https://github.com/kvcache-ai/ktransformers.git

Also, for external dependencies i installed:

sudo apt-get install -y libblis-openmp-dev libhwloc-dev libnuma-dev pkg-config 

Nvidia information:

NVIDIA-SMI 580.95.05              Driver Version: 580.95.05    

bluecoconut avatar Nov 06 '25 22:11 bluecoconut

Try --kt-threadpool-count 2 instead of --kt-threadpool-count 1. It fixed the issue for me on single Epyc with 4x 6000 Pros.

0xhaggis avatar Nov 10 '25 00:11 0xhaggis

I did, haha. The command above includes --kt-threadpool-count 1. My issue wasn't the NUMA, it was that the output tokens were incoherent.

When you got it running witht he single epyc and 4x 6000 pros, did you get valid tokens out? I was able to get ~20 tokens per second but it was just gibberish.

bluecoconut avatar Nov 10 '25 02:11 bluecoconut

Yes, i've got it tuned for about 36 tokens/sec and it's looking really good. Mind you, it is only coherent with --kt-threadpool-count 2. At 1 it spews garbage, like you said.

Also --ep doesn't work, it writes garbage with that, too.

0xhaggis avatar Nov 10 '25 07:11 0xhaggis

Coming back to this a few days later: My guess is that using an AMD processor is the main issue, as the AMXINT4 is an Intel specific instruction (?). That said, turning off --kt-amx-method AMXINT4, it still doesn't work. It's clearly successfully executing some fallback (it's not hard crashing), but the fallback is seeing the weights in some invalid way / incorrectly parsing them, so it generates gibberish.

bluecoconut avatar Nov 11 '25 19:11 bluecoconut