ktransformers icon indicating copy to clipboard operation
ktransformers copied to clipboard

kt-kernel 0.4.2 Kimi-K2 Thinking: torch.OutOfMemoryError: CUDA out of memory.

Open mrgaolei opened this issue 1 month ago • 7 comments

Reminder

  • [x] I have read the above rules and searched the existing issues.

System Info

Ubuntu 24.04.3 kt-kernel 0.4.2 Kimi-K2 Thinking

Dual-6416h CPU 1T RAM RTX 3090 24G

Reproduction

python -m sglang.launch_server   --host 0.0.0.0   --port 60000   --model /opt/ai-models/Kimi-K2-Thinking/   --kt-weight-path /opt/ai-models/Kimi-K2-Instruct-CPU-weight/   --kt-cpuinfer 36   --kt-threadpool-count 2   --kt-num-gpu-experts 16   --kt-method AMXINT4   --attention-backend flashinfer   --trust-remote-code   --mem-fraction-static 0.98   --chunked-prefill-size 4096   --max-running-requests 37
   --max-total-tokens 3700   --enable-mixed-chunk   --tensor-parallel-size 1   --enable-p2p-check   --disable-shared-experts-fusion```

WorkerPool[0x2a5a6f30] 2 subpools, [numa:threads][0:18] [1:18] ===========In NumaPool============ In Numa Worker Pool at NUMA 0, 18 threads ===========In NumaPool============ In Numa Worker Pool at NUMA 1, 18 threads [2025-11-28 11:19:49] Scheduler hit an exception: Traceback (most recent call last): File "/home/aigao/sglang/python/sglang/srt/managers/scheduler.py", line 2630, in run_scheduler_process scheduler = Scheduler( ^^^^^^^^^^ File "/home/aigao/sglang/python/sglang/srt/managers/scheduler.py", line 309, in init self.tp_worker = TpModelWorker( ^^^^^^^^^^^^^^ File "/home/aigao/sglang/python/sglang/srt/managers/tp_worker.py", line 237, in init self._model_runner = ModelRunner( ^^^^^^^^^^^^ File "/home/aigao/sglang/python/sglang/srt/model_executor/model_runner.py", line 338, in init self.initialize(min_per_gpu_memory) File "/home/aigao/sglang/python/sglang/srt/model_executor/model_runner.py", line 442, in initialize self.load_model() File "/home/aigao/sglang/python/sglang/srt/model_executor/model_runner.py", line 806, in load_model self.model = get_model( ^^^^^^^^^^ File "/home/aigao/sglang/python/sglang/srt/model_loader/init.py", line 28, in get_model return loader.load_model( ^^^^^^^^^^^^^^^^^^ File "/home/aigao/sglang/python/sglang/srt/model_loader/loader.py", line 594, in load_model model = _initialize_model( ^^^^^^^^^^^^^^^^^^ File "/home/aigao/sglang/python/sglang/srt/model_loader/loader.py", line 262, in _initialize_model return model_class(**kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/aigao/sglang/python/sglang/srt/models/deepseek_v2.py", line 3394, in init self.model = DeepseekV2Model( ^^^^^^^^^^^^^^^^ File "/home/aigao/sglang/python/sglang/srt/models/deepseek_v2.py", line 3172, in init self.model = DeepseekV2Model( File "/home/aigao/sglang/python/sglang/srt/utils/common.py", line 591, in make_layers + get_offloader().wrap_modules( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/aigao/sglang/python/sglang/srt/utils/offloader.py", line 36, in wrap_modules return list(all_modules_generator) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/aigao/sglang/python/sglang/srt/utils/common.py", line 593, in layer_fn(idx=idx, prefix=add_prefix(idx, prefix)) File "/home/aigao/sglang/python/sglang/srt/models/deepseek_v2.py", line 3174, in lambda idx, prefix: DeepseekV2DecoderLayer( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/aigao/sglang/python/sglang/srt/models/deepseek_v2.py", line 2895, in init self.self_attn = DeepseekV2AttentionMLA( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/aigao/sglang/python/sglang/srt/models/deepseek_v2.py", line 1305, in init self.o_proj = RowParallelLinear( ^^^^^^^^^^^^^^^^^^ File "/home/aigao/sglang/python/sglang/srt/layers/linear.py", line 1255, in init self.quant_method.create_weights( File "/home/aigao/sglang/python/sglang/srt/layers/quantization/unquant.py", line 108, in create_weights torch.empty( File "/home/aigao/anaconda3/envs/kt-kernel/lib/python3.11/site-packages/torch/utils/_device.py", line 103, in torch_function return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 23.56 GiB of which 109.50 MiB is free. Processtorch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 23.56 GiB of which 109.50 MiB is free. Process 4308 has 360.00 MiB memory in use. Including non-PyTorch memory, this process has 23.07 GiB memory in use. Of the allocated memory 22.72 GiB is alloc 4308 has 360.00 MiB memory in use. Including non-PyTorch memory, this process has 23.07 GiB memory in use. Of the allocated memory 22.72 GiB is alloca


### Others

Can kt-kernel support 24G DRAM? like kt 0.3.2?
If not, is mean only 1 n090 card can only use kt 0.3.2?

mrgaolei avatar Nov 28 '25 03:11 mrgaolei

The issue is that you used --kt-num-gpu-experts 16, which specifies that each layer has 16 experts on the GPU. The 24GB VRAM can’t handle that, so try lowering this parameter.

chenht2022 avatar Nov 28 '25 06:11 chenht2022

The issue is that you used --kt-num-gpu-experts 16, which specifies that each layer has 16 experts on the GPU. The 24GB VRAM can’t handle that, so try lowering this parameter.

I tried:

--kt-num-gpu-experts [1-4]

this crash also happen. Who tested 24G VRAM and 1T RAM how to start Kimi-K2 via kt 0.4.2?

mrgaolei avatar Nov 29 '25 05:11 mrgaolei

I see. The native Kimi-K2-Thinking model uses BF16-precision (non-expert) weights on the GPU side, so it consumes more VRAM than DeepSeek-V3/R1. A 24 GB GPU isn’t sufficient. You may consider using a quantized version.

chenht2022 avatar Dec 01 '25 11:12 chenht2022

you need at least 27-28GB VRAM even --kt-num-gpu-experts 0

shrould8888 avatar Dec 01 '25 21:12 shrould8888

I see. The native Kimi-K2-Thinking model uses BF16-precision (non-expert) weights on the GPU side, so it consumes more VRAM than DeepSeek-V3/R1. A 24 GB GPU isn’t sufficient. You may consider using a quantized version.

Does kt 0.4.2 support Kimi-K2 Thinking Q4 model?

mrgaolei avatar Dec 02 '25 03:12 mrgaolei

We provide the scripts to perform quantization. But by default, we quantize expert weights but not non-expert weights. However, you currently only need to quantize the non-expert weights. I guess you may customize which weights get quantized by hacking the ignore_patterns in convert_gpu_weights.py.

Is it right? @ovowei

chenht2022 avatar Dec 02 '25 04:12 chenht2022

you just need one more 24GB 3090.

I run this model with 4090 x 2 without any problems

I recommend that you should have at least 48GB VRAM in order to have decent context length

shrould8888 avatar Dec 02 '25 06:12 shrould8888