ktransformers icon indicating copy to clipboard operation
ktransformers copied to clipboard

DeepSeek V3.2-Speciale CUDA OUT OF MEMORY

Open shrould8888 opened this issue 3 weeks ago • 0 comments

Reminder

  • [x] I have read the above rules and searched the existing issues.

System Info

Hello All! My config: Xeon 5 768GB RAM 4090 24GB x 2 (total 48GB VRAM) nvidia driver: 580.105.08 (patched for p2p) Linux version: 6.15.9

the command I run: python -m sglang.launch_server
--host 0.0.0.0
--port 10002
--model /llm2/DeepSeek-V3.2-Speciale
--kt-amx-weight-path /llm2/DeepSeek-V3.2-Speciale-CPU-INT8
--kt-cpuinfer 60
--kt-threadpool-count 1
--kt-num-gpu-experts 0
--attention-backend triton
--trust-remote-code
--mem-fraction-static 0.98
--chunked-prefill-size 4096
--max-running-requests 32
--max-total-tokens 40000
--served-model-name DeepSeek-V3.2
--enable-mixed-chunk
--tensor-parallel-size 2
--enable-p2p-check
--disable-shared-experts-fusion
--kt-amx-method AMXINT8

I got the following error message, Could someone teach me how to fix it? Thanks in advance! [2025-12-03 16:12:00] WARNING server_args.py:1073: DP attention is enabled for DeepSeek NSA. [2025-12-03 16:12:00] WARNING server_args.py:1076: Setting page size to 64 for DeepSeek NSA. [2025-12-03 16:12:00] WARNING server_args.py:1084: Setting KV cache dtype to bfloat16 for DeepSeek NSA. NSA_DUAL_STREAM=True NSA_FUSE_TOPK=True NSA_FLASHMLA_BACKEND_DECODE_COMPUTE_FP8=True NSA_QUANT_K_CACHE_FAST=True NSA_DEQUANT_K_CACHE_FAST=True [2025-12-03 16:12:00] WARNING server_args.py:1383: DP attention is enabled. The chunked prefill size is adjusted to 2048 to avoid MoE kernel issues. [2025-12-03 16:12:00] INFO trace.py:52: opentelemetry package is not installed, tracing disabled [2025-12-03 16:12:01] Fail to set RLIMIT_NOFILE: current limit exceeds maximum limit [2025-12-03 16:12:01] server_args=ServerArgs(model_path='/llm2/DeepSeek-V3.2-Speciale', tokenizer_path='/llm2/DeepSeek-V3.2-Speciale', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=10002, grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='bfloat16', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, mem_fraction_static=0.98, max_running_requests=32, max_queued_requests=None, max_total_tokens=40000, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=0.3, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=2, pp_size=1, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=908527079, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', api_key=None, served_model_name='DeepSeek-V3.2', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=2, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='triton', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_moe_runner_backend=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_amx_weight_path='/llm2/DeepSeek-V3.2-Speciale-CPU-INT8', kt_amx_method='AMXINT8', kt_cpuinfer=60, kt_threadpool_count=1, kt_num_gpu_experts=0, kt_max_deferred_experts_per_token=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=24, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=True, enable_dp_attention=True, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=True, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=True, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, decrypted_config_file=None, decrypted_draft_config_file=None) [2025-12-03 16:12:01] Using default HuggingFace chat template with detected content format: string [2025-12-03 16:12:05] INFO trace.py:52: opentelemetry package is not installed, tracing disabled [2025-12-03 16:12:05] INFO trace.py:52: opentelemetry package is not installed, tracing disabled [2025-12-03 16:12:09] INFO trace.py:52: opentelemetry package is not installed, tracing disabled [2025-12-03 16:12:09] INFO trace.py:52: opentelemetry package is not installed, tracing disabled [2025-12-03 16:12:09 DP1 TP1] Init torch distributed begin. [2025-12-03 16:12:09 DP0 TP0] Init torch distributed begin. [rank0]:[W1203 16:12:10.762675926 ProcessGroupGloo.cpp:514] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator()) [rank1]:[W1203 16:12:10.762832238 ProcessGroupGloo.cpp:514] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator()) [Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1 [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1 [Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1 [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1 [2025-12-03 16:12:10 DP0 TP0] sglang is using nccl==2.27.7 [2025-12-03 16:12:10 DP0 TP0] reading GPU P2P access cache from /home/shroud/.cache/sglang/gpu_p2p_access_cache_for_0,1.json [2025-12-03 16:12:10 DP1 TP1] reading GPU P2P access cache from /home/shroud/.cache/sglang/gpu_p2p_access_cache_for_0,1.json [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [2025-12-03 16:12:11 DP0 TP0] Init torch distributed ends. mem usage=0.23 GB [2025-12-03 16:12:11 DP1 TP1] Init torch distributed ends. mem usage=0.23 GB [2025-12-03 16:12:12 DP1 TP1] Load weight begin. avail mem=22.88 GB [2025-12-03 16:12:12 DP0 TP0] Load weight begin. avail mem=22.85 GB [2025-12-03 16:12:12 DP0 TP0] Detected fp8 checkpoint. [2025-12-03 16:12:12 DP0 TP0] Detected fp8 checkpoint. [2025-12-03 16:12:12 DP1 TP1] Scheduler hit an exception: Traceback (most recent call last): File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2802, in run_scheduler_process scheduler = Scheduler( ^^^^^^^^^^ File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 311, in init self.tp_worker = TpModelWorker( ^^^^^^^^^^^^^^ File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 237, in init self._model_runner = ModelRunner( ^^^^^^^^^^^^ File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 322, in init self.initialize(min_per_gpu_memory) File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 398, in initialize self.load_model() File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 752, in load_model self.model = get_model( ^^^^^^^^^^ File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/model_loader/init.py", line 28, in get_model return loader.load_model( ^^^^^^^^^^^^^^^^^^ File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 594, in load_model model = _initialize_model( ^^^^^^^^^^^^^^^^^^ File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 262, in _initialize_model return model_class(**kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/models/deepseek_v2.py", line 3031, in init self.model = DeepseekV2Model( ^^^^^^^^^^^^^^^^ File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/models/deepseek_v2.py", line 2823, in init self.layers, self.start_layer, self.end_layer = make_layers( ^^^^^^^^^^^^ File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/utils/common.py", line 576, in make_layers + get_offloader().wrap_modules( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/utils/offloader.py", line 36, in wrap_modules return list(all_modules_generator) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/utils/common.py", line 578, in layer_fn(idx=idx, prefix=add_prefix(idx, prefix)) File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/models/deepseek_v2.py", line 2825, in lambda idx, prefix: DeepseekV2DecoderLayer( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/models/deepseek_v2.py", line 2613, in init self.mlp = DeepseekV2MoE( ^^^^^^^^^^^^^^ File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/models/deepseek_v2.py", line 625, in init self.experts = get_moe_impl_class(quant_config)( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 214, in init self.quant_method.create_weights( File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/layers/quantization/fp8.py", line 603, in create_weights torch.empty( File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/torch/utils/_device.py", line 103, in torch_function return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.50 GiB. GPU 1 has a total capacity of 23.52 GiB of which 3.37 GiB is free. Including non-PyTorch memory, this process has 20.13 GiB memory in use. Of the allocated memory 19.53 GiB is allocated by PyTorch, and 17.70 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Reproduction

run the following command:
python -m sglang.launch_server \
  --host 0.0.0.0 \
  --port 10002 \
  --model /llm2/DeepSeek-V3.2-Speciale \
  --kt-amx-weight-path /llm2/DeepSeek-V3.2-Speciale-CPU-INT8 \
  --kt-cpuinfer 60 \
  --kt-threadpool-count 1 \
  --kt-num-gpu-experts 0 \
  --attention-backend triton \
  --trust-remote-code \
  --mem-fraction-static 0.98 \
  --chunked-prefill-size 4096 \
  --max-running-requests 32 \
  --max-total-tokens 40000 \
  --served-model-name DeepSeek-V3.2 \
  --enable-mixed-chunk \
  --tensor-parallel-size 2 \
  --enable-p2p-check \
  --disable-shared-experts-fusion \
  --kt-amx-method AMXINT8

Others

No response

shrould8888 avatar Dec 03 '25 08:12 shrould8888