Chen Hongtao comments

Results 7 comments of


                                            Chen Hongtao

[Bug] 服务化后，多次请求，概率性报错_assert_async_cuda_kernel

@PhzCode 能否提供一下以下信息： 1. 出错的case中使用的是什么CPU backend，AMX还是llamafile？ 2. 报错的那一条请求的prompt是什么，或者长度是多少token？

[Bug] 服务化后，多次请求，概率性报错_assert_async_cuda_kernel

感谢提供信息，我们会尽快排查

kt-kernel 0.4.2 Kimi-K2 Thinking: torch.OutOfMemoryError: CUDA out of memory.

The issue is that you used `--kt-num-gpu-experts 16`, which specifies that each layer has 16 experts on the GPU. The 24GB VRAM can’t handle that, so try lowering this parameter.

kt-kernel 0.4.2 Kimi-K2 Thinking: torch.OutOfMemoryError: CUDA out of memory.

I see. The native Kimi-K2-Thinking model uses BF16-precision (non-expert) weights on the GPU side, so it consumes more VRAM than DeepSeek-V3/R1. A 24 GB GPU isn’t sufficient. You may consider...

kt-kernel 0.4.2 Kimi-K2 Thinking: torch.OutOfMemoryError: CUDA out of memory.

We provide the [scripts](https://github.com/kvcache-ai/ktransformers/blob/main/kt-kernel/scripts) to perform quantization. But by default, we quantize expert weights but not non-expert weights. However, you currently only need to quantize the non-expert weights. I guess...

defer expert

这部分代码尚未合并到主分支，合并工作正在进行中。可以先看[sosp25-ae分支](https://github.com/kvcache-ai/ktransformers/tree/sosp25-ae/sosp25-ae)。 This part of the code has not yet been merged into the main branch, and the merging process is ongoing. You can refer to the [sosp25-ae branch](https://github.com/kvcache-ai/ktransformers/tree/sosp25-ae/sosp25-ae) in the...

defer expert

KTransformers is refactored and the YAML-based flexible injection framework is currently deprecated. The inference part now resides on [kt-kernel](https://github.com/kvcache-ai/ktransformers/tree/main/kt-kernel) and is recommended to be launched with SGLang. When launching SGLang...