kt-kernel 0.4.2 Kimi-K2 Thinking: torch.OutOfMemoryError: CUDA out of memory.
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
Ubuntu 24.04.3 kt-kernel 0.4.2 Kimi-K2 Thinking
Dual-6416h CPU 1T RAM RTX 3090 24G
Reproduction
python -m sglang.launch_server --host 0.0.0.0 --port 60000 --model /opt/ai-models/Kimi-K2-Thinking/ --kt-weight-path /opt/ai-models/Kimi-K2-Instruct-CPU-weight/ --kt-cpuinfer 36 --kt-threadpool-count 2 --kt-num-gpu-experts 16 --kt-method AMXINT4 --attention-backend flashinfer --trust-remote-code --mem-fraction-static 0.98 --chunked-prefill-size 4096 --max-running-requests 37
--max-total-tokens 3700 --enable-mixed-chunk --tensor-parallel-size 1 --enable-p2p-check --disable-shared-experts-fusion```
WorkerPool[0x2a5a6f30] 2 subpools, [numa:threads][0:18] [1:18]
===========In NumaPool============
In Numa Worker Pool at NUMA 0, 18 threads
===========In NumaPool============
In Numa Worker Pool at NUMA 1, 18 threads
[2025-11-28 11:19:49] Scheduler hit an exception: Traceback (most recent call last):
File "/home/aigao/sglang/python/sglang/srt/managers/scheduler.py", line 2630, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/aigao/sglang/python/sglang/srt/managers/scheduler.py", line 309, in init
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/aigao/sglang/python/sglang/srt/managers/tp_worker.py", line 237, in init
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/aigao/sglang/python/sglang/srt/model_executor/model_runner.py", line 338, in init
self.initialize(min_per_gpu_memory)
File "/home/aigao/sglang/python/sglang/srt/model_executor/model_runner.py", line 442, in initialize
self.load_model()
File "/home/aigao/sglang/python/sglang/srt/model_executor/model_runner.py", line 806, in load_model
self.model = get_model(
^^^^^^^^^^
File "/home/aigao/sglang/python/sglang/srt/model_loader/init.py", line 28, in get_model
return loader.load_model(
^^^^^^^^^^^^^^^^^^
File "/home/aigao/sglang/python/sglang/srt/model_loader/loader.py", line 594, in load_model
model = _initialize_model(
^^^^^^^^^^^^^^^^^^
File "/home/aigao/sglang/python/sglang/srt/model_loader/loader.py", line 262, in _initialize_model
return model_class(**kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/aigao/sglang/python/sglang/srt/models/deepseek_v2.py", line 3394, in init
self.model = DeepseekV2Model(
^^^^^^^^^^^^^^^^
File "/home/aigao/sglang/python/sglang/srt/models/deepseek_v2.py", line 3172, in init
self.model = DeepseekV2Model(
File "/home/aigao/sglang/python/sglang/srt/utils/common.py", line 591, in make_layers
+ get_offloader().wrap_modules(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aigao/sglang/python/sglang/srt/utils/offloader.py", line 36, in wrap_modules
return list(all_modules_generator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aigao/sglang/python/sglang/srt/utils/common.py", line 593, in
### Others
Can kt-kernel support 24G DRAM? like kt 0.3.2?
If not, is mean only 1 n090 card can only use kt 0.3.2?
The issue is that you used --kt-num-gpu-experts 16, which specifies that each layer has 16 experts on the GPU. The 24GB VRAM can’t handle that, so try lowering this parameter.
The issue is that you used
--kt-num-gpu-experts 16, which specifies that each layer has 16 experts on the GPU. The 24GB VRAM can’t handle that, so try lowering this parameter.
I tried:
--kt-num-gpu-experts [1-4]
this crash also happen. Who tested 24G VRAM and 1T RAM how to start Kimi-K2 via kt 0.4.2?
I see. The native Kimi-K2-Thinking model uses BF16-precision (non-expert) weights on the GPU side, so it consumes more VRAM than DeepSeek-V3/R1. A 24 GB GPU isn’t sufficient. You may consider using a quantized version.
you need at least 27-28GB VRAM even --kt-num-gpu-experts 0
I see. The native Kimi-K2-Thinking model uses BF16-precision (non-expert) weights on the GPU side, so it consumes more VRAM than DeepSeek-V3/R1. A 24 GB GPU isn’t sufficient. You may consider using a quantized version.
Does kt 0.4.2 support Kimi-K2 Thinking Q4 model?
We provide the scripts to perform quantization. But by default, we quantize expert weights but not non-expert weights. However, you currently only need to quantize the non-expert weights. I guess you may customize which weights get quantized by hacking the ignore_patterns in convert_gpu_weights.py.
Is it right? @ovowei
you just need one more 24GB 3090.
I run this model with 4090 x 2 without any problems
I recommend that you should have at least 48GB VRAM in order to have decent context length