vllm set chunked_prefill off when use mla

FIX #13370 (link existing issues this PR will resolve)

in vllm/config.py , it will forcing chunked prefill and prefix caching to be disabled, but it's too late, the max_num_batched_tokens will be set 2048 by default when user use --enable-chunked-prefill for mla attention model

Feb 17 '25 05:02 DragonFive

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Feb 17 '25 05:02 github-actions[bot]

Resolved by chunked prefill support

Feb 25 '25 18:02 mgoin

vllm 0.7.1 torch 2.5.1 when use this version vllm , and set VLLM_TORCH_PROFILER_DIR=./traces/ command as following : VLLM_TORCH_PROFILER_DIR=./traces/ vllm serve /workspace/models/DeepSeek-V2-Lite-Chat
--gpu-memory-utilization 0.80
--max-model-len 8000
--max-num-batched-tokens 32000
--max-num-seqs 1024
--trust-remote-code \

deepseek-v2_triton$(date +%Y%m%d-%H%M).log &

MLA is enabled; forcing chunked prefill and prefix caching to be disabled.

INFO 03-13 03:06:05 init.py:183] Automatically detected platform cuda. WARNING 03-13 03:06:06 api_server.py:610] Torch Profiler is enabled in the API server. This should ONLY be used for local development! async_args_only: False parser: FlexibleArgumentParser(prog='vllm serve', usage='vllm serve <model_tag> [options]', description=None, formatter_class=<class 'vllm.utils.SortedHelpFormatter'>, conflict_handler='error', add_help=True) INFO 03-13 03:06:06 api_server.py:838] vLLM API server version 0.7.1 INFO 03-13 03:06:06 api_server.py:839] args: Namespace(subparser='serve', model_tag='/workspace/models/DeepSeek-V2-Lite-Chat', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/workspace/models/DeepSeek-V2-Lite-Chat', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=32000, max_num_seqs=1024, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7f968b7083a0>) INFO 03-13 03:06:06 api_server.py:204] Started engine process with PID 372585 INFO 03-13 03:06:06 config.py:135] Replacing legacy 'type' key with 'rope_type' INFO 03-13 03:06:11 init.py:183] Automatically detected platform cuda. WARNING 03-13 03:06:12 api_server.py:610] Torch Profiler is enabled in the API server. This should ONLY be used for local development! INFO 03-13 03:06:12 config.py:135] Replacing legacy 'type' key with 'rope_type' INFO 03-13 03:06:13 config.py:526] This model supports multiple tasks: {'reward', 'score', 'embed', 'classify', 'generate'}. Defaulting to 'generate'. INFO 03-13 03:06:13 config.py:3257] MLA is enabled; forcing chunked prefill and prefix caching to be disabled. INFO 03-13 03:06:18 config.py:526] This model supports multiple tasks: {'reward', 'embed', 'generate', 'score', 'classify'}. Defaulting to 'generate'. INFO 03-13 03:06:18 config.py:3257] MLA is enabled; forcing chunked prefill and prefix caching to be disabled. INFO 03-13 03:06:18 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='/workspace/models/DeepSeek-V2-Lite-Chat', speculative_config=None, tokenizer='/workspace/models/DeepSeek-V2-Lite-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/workspace/models/DeepSeek-V2-Lite-Chat, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[1024,1016,1008,1000,992,984,976,968,960,952,944,936,928,920,912,904,896,888,880,872,864,856,848,840,832,824,816,808,800,792,784,776,768,760,752,744,736,728,720,712,704,696,688,680,672,664,656,648,640,632,624,616,608,600,592,584,576,568,560,552,544,536,528,520,512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":1024}, use_cached_outputs=True, INFO 03-13 03:06:19 cuda.py:166] Using Triton MLA backend. WARNING 03-13 03:06:20 triton_decode_attention.py:42] The following error message 'operation scheduled before its operands' can be ignored. INFO 03-13 03:06:20 worker.py:101] Profiling enabled. Traces will be saved to: ./traces/ INFO 03-13 03:06:20 model_runner.py:1111] Starting to load model /workspace/models/DeepSeek-V2-Lite-Chat... INFO 03-13 03:06:20 cuda.py:166] Using Triton MLA backend. INFO 03-13 03:06:35 model_runner.py:1116] Loading model weights took 31.1253 GB WARNING 03-13 03:06:35 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /workspace/miniconda/envs/llama/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=1408,device_name=NVIDIA_H100_80GB_HBM3.json INFO 03-13 03:06:36 worker.py:266] Memory profiling takes 1.51 seconds INFO 03-13 03:06:36 worker.py:266] the current vLLM instance can use total_gpu_memory (79.32GiB) x gpu_memory_utilization (0.80) = 63.46GiB INFO 03-13 03:06:36 worker.py:266] model weights take 31.13GiB; non_torch_memory takes 0.15GiB; PyTorch activation peak memory takes 3.83GiB; the rest of the memory reserved for KV Cache is 28.34GiB. INFO 03-13 03:06:36 executor_base.py:108] # CUDA blocks: 61150, # CPU blocks: 8630

Mar 13 '25 03:03 dshwei