DeepSeek V3.2-Speciale CUDA OUT OF MEMORY
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
Hello All! My config: Xeon 5 768GB RAM 4090 24GB x 2 (total 48GB VRAM) nvidia driver: 580.105.08 (patched for p2p) Linux version: 6.15.9
the command I run:
python -m sglang.launch_server
--host 0.0.0.0
--port 10002
--model /llm2/DeepSeek-V3.2-Speciale
--kt-amx-weight-path /llm2/DeepSeek-V3.2-Speciale-CPU-INT8
--kt-cpuinfer 60
--kt-threadpool-count 1
--kt-num-gpu-experts 0
--attention-backend triton
--trust-remote-code
--mem-fraction-static 0.98
--chunked-prefill-size 4096
--max-running-requests 32
--max-total-tokens 40000
--served-model-name DeepSeek-V3.2
--enable-mixed-chunk
--tensor-parallel-size 2
--enable-p2p-check
--disable-shared-experts-fusion
--kt-amx-method AMXINT8
I got the following error message, Could someone teach me how to fix it? Thanks in advance!
[2025-12-03 16:12:00] WARNING server_args.py:1073: DP attention is enabled for DeepSeek NSA.
[2025-12-03 16:12:00] WARNING server_args.py:1076: Setting page size to 64 for DeepSeek NSA.
[2025-12-03 16:12:00] WARNING server_args.py:1084: Setting KV cache dtype to bfloat16 for DeepSeek NSA.
NSA_DUAL_STREAM=True NSA_FUSE_TOPK=True NSA_FLASHMLA_BACKEND_DECODE_COMPUTE_FP8=True NSA_QUANT_K_CACHE_FAST=True NSA_DEQUANT_K_CACHE_FAST=True
[2025-12-03 16:12:00] WARNING server_args.py:1383: DP attention is enabled. The chunked prefill size is adjusted to 2048 to avoid MoE kernel issues.
[2025-12-03 16:12:00] INFO trace.py:52: opentelemetry package is not installed, tracing disabled
[2025-12-03 16:12:01] Fail to set RLIMIT_NOFILE: current limit exceeds maximum limit
[2025-12-03 16:12:01] server_args=ServerArgs(model_path='/llm2/DeepSeek-V3.2-Speciale', tokenizer_path='/llm2/DeepSeek-V3.2-Speciale', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=True, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=10002, grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='bfloat16', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, mem_fraction_static=0.98, max_running_requests=32, max_queued_requests=None, max_total_tokens=40000, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=0.3, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', device='cuda', tp_size=2, pp_size=1, pp_max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=908527079, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', api_key=None, served_model_name='DeepSeek-V3.2', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=2, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='triton', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, nsa_prefill_backend='flashmla_sparse', nsa_decode_backend='fa3', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_moe_runner_backend=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype='float32', mamba_full_memory_ratio=0.9, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, kt_amx_weight_path='/llm2/DeepSeek-V3.2-Speciale-CPU-INT8', kt_amx_method='AMXINT8', kt_cpuinfer=60, kt_threadpool_count=1, kt_num_gpu_experts=0, kt_max_deferred_experts_per_token=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=24, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=True, enable_dp_attention=True, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=4096, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 640, 768, 896, 1024, 1152, 1280, 1408, 1536, 1664, 1792, 1920, 2048, 2176, 2304, 2432, 2560, 2688, 2816, 2944, 3072, 3200, 3328, 3456, 3584, 3712, 3840, 3968, 4096], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=True, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=True, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, decrypted_config_file=None, decrypted_draft_config_file=None)
[2025-12-03 16:12:01] Using default HuggingFace chat template with detected content format: string
[2025-12-03 16:12:05] INFO trace.py:52: opentelemetry package is not installed, tracing disabled
[2025-12-03 16:12:05] INFO trace.py:52: opentelemetry package is not installed, tracing disabled
[2025-12-03 16:12:09] INFO trace.py:52: opentelemetry package is not installed, tracing disabled
[2025-12-03 16:12:09] INFO trace.py:52: opentelemetry package is not installed, tracing disabled
[2025-12-03 16:12:09 DP1 TP1] Init torch distributed begin.
[2025-12-03 16:12:09 DP0 TP0] Init torch distributed begin.
[rank0]:[W1203 16:12:10.762675926 ProcessGroupGloo.cpp:514] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[rank1]:[W1203 16:12:10.762832238 ProcessGroupGloo.cpp:514] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2025-12-03 16:12:10 DP0 TP0] sglang is using nccl==2.27.7
[2025-12-03 16:12:10 DP0 TP0] reading GPU P2P access cache from /home/shroud/.cache/sglang/gpu_p2p_access_cache_for_0,1.json
[2025-12-03 16:12:10 DP1 TP1] reading GPU P2P access cache from /home/shroud/.cache/sglang/gpu_p2p_access_cache_for_0,1.json
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2025-12-03 16:12:11 DP0 TP0] Init torch distributed ends. mem usage=0.23 GB
[2025-12-03 16:12:11 DP1 TP1] Init torch distributed ends. mem usage=0.23 GB
[2025-12-03 16:12:12 DP1 TP1] Load weight begin. avail mem=22.88 GB
[2025-12-03 16:12:12 DP0 TP0] Load weight begin. avail mem=22.85 GB
[2025-12-03 16:12:12 DP0 TP0] Detected fp8 checkpoint.
[2025-12-03 16:12:12 DP0 TP0] Detected fp8 checkpoint.
[2025-12-03 16:12:12 DP1 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2802, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 311, in init
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 237, in init
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 322, in init
self.initialize(min_per_gpu_memory)
File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 398, in initialize
self.load_model()
File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 752, in load_model
self.model = get_model(
^^^^^^^^^^
File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/model_loader/init.py", line 28, in get_model
return loader.load_model(
^^^^^^^^^^^^^^^^^^
File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 594, in load_model
model = _initialize_model(
^^^^^^^^^^^^^^^^^^
File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/model_loader/loader.py", line 262, in _initialize_model
return model_class(**kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/models/deepseek_v2.py", line 3031, in init
self.model = DeepseekV2Model(
^^^^^^^^^^^^^^^^
File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/models/deepseek_v2.py", line 2823, in init
self.layers, self.start_layer, self.end_layer = make_layers(
^^^^^^^^^^^^
File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/utils/common.py", line 576, in make_layers
+ get_offloader().wrap_modules(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/utils/offloader.py", line 36, in wrap_modules
return list(all_modules_generator)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/shroud/ktransformers-0.4.1/.venv/lib/python3.12/site-packages/sglang/srt/utils/common.py", line 578, in
Reproduction
run the following command:
python -m sglang.launch_server \
--host 0.0.0.0 \
--port 10002 \
--model /llm2/DeepSeek-V3.2-Speciale \
--kt-amx-weight-path /llm2/DeepSeek-V3.2-Speciale-CPU-INT8 \
--kt-cpuinfer 60 \
--kt-threadpool-count 1 \
--kt-num-gpu-experts 0 \
--attention-backend triton \
--trust-remote-code \
--mem-fraction-static 0.98 \
--chunked-prefill-size 4096 \
--max-running-requests 32 \
--max-total-tokens 40000 \
--served-model-name DeepSeek-V3.2 \
--enable-mixed-chunk \
--tensor-parallel-size 2 \
--enable-p2p-check \
--disable-shared-experts-fusion \
--kt-amx-method AMXINT8
Others
No response