vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: Enabling fp8 KV cache quantization and prefix caching at the same time on Radeon (W7900/RDNA3) crashes the process

Open hjc4869 opened this issue 1 year ago β€’ 1 comments

Your current environment

The output of `python collect_env.py`
INFO 02-12 09:27:59 __init__.py:190] Automatically detected platform rocm.
WARNING 02-12 09:27:59 rocm.py:33] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead.
Collecting environment information...
PyTorch version: 2.6.0.dev20241230+rocm6.3
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.3.42131-fa1d09cbd

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.31.2
Libc version: glibc-2.35

Python version: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.11.11-gddbf17641154-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Radeon PRO W7900 Dual Slot  (gfx1100)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.3.42131
MIOpen runtime version: 3.3.0
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               64
On-line CPU(s) list:                  0-63
Vendor ID:                            AuthenticAMD
Model name:                           AMD Ryzen Threadripper 7970X 32-Cores
CPU family:                           25
Model:                                24
Thread(s) per core:                   2
Core(s) per socket:                   32
Socket(s):                            1
Stepping:                             1
Frequency boost:                      enabled
CPU max MHz:                          5352.0000
CPU min MHz:                          545.0000
BogoMIPS:                             7987.23
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap
Virtualization:                       AMD-V
L1d cache:                            1 MiB (32 instances)
L1i cache:                            1 MiB (32 instances)
L2 cache:                             32 MiB (32 instances)
L3 cache:                             128 MiB (4 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-63
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; Safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] lion-pytorch==0.2.3
[pip3] numpy==1.26.4
[pip3] pynvml==11.5.3
[pip3] pytorch-triton-rocm==3.2.0+git0d4682f0
[pip3] pyzmq==26.2.0
[pip3] torch==2.6.0.dev20241230+rocm6.3
[pip3] torchaudio==2.6.0.dev20241230+rocm6.2.4
[pip3] torchvision==0.22.0.dev20241230+rocm6.2.4
[pip3] transformers==4.48.3
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: 6.3.42134-a9a80e791
Neuron SDK Version: N/A
vLLM Version: 0.7.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         GPU1         GPU2         
GPU0   0            40           40           
GPU1   40           0            40           
GPU2   40           40           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         GPU2         
GPU0   0            2            2            
GPU1   2            0            2            
GPU2   2            2            0            

=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         
GPU0   0            PCIE         PCIE         
GPU1   PCIE         0            PCIE         
GPU2   PCIE         PCIE         0            

======================================= Numa Nodes =======================================
GPU[0]          : (Topology) Numa Node: 0
GPU[0]          : (Topology) Numa Affinity: -1
GPU[1]          : (Topology) Numa Node: 0
GPU[1]          : (Topology) Numa Affinity: -1
GPU[2]          : (Topology) Numa Node: 0
GPU[2]          : (Topology) Numa Affinity: -1
================================== End of ROCm SMI Log ===================================

NCCL_DMABUF_ENABLE=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

πŸ› Describe the bug

Server command line:

python -O -u -m vllm.entrypoints.openai.api_server --host=0.0.0.0 --port=8000 \
    --model ~/models/Qwen2.5-72B-Instruct-GPTQ-Int8/ \
    --served-model-name Qwen\ 2.5\ 72B \
    --enable-prefix-caching \
    --tensor-parallel-size 2 \
    --kv-cache-dtype fp8

Client command line (run twice):

curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen 2.5 72B", "messages": [{"role": "user", "content": "Test"}]}'

Error message: ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e5',)")

Either removing --enable-prefix-caching or --kv-cache-dtype fp8 solves the problem

The full output of vLLM server ```text INFO 02-12 09:30:38 __init__.py:190] Automatically detected platform rocm. WARNING 02-12 09:30:38 rocm.py:33] `fork` method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to `spawn` instead. INFO 02-12 09:30:39 api_server.py:840] vLLM API server version 0.7.2 INFO 02-12 09:30:39 api_server.py:841] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/home/user/models/Qwen2.5-72B-Instruct-GPTQ-Int8/', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=, dtype='auto', kv_cache_dtype='fp8', max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen 2.5 72B'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False) INFO 02-12 09:30:39 api_server.py:206] Started engine process with PID 111326 INFO 02-12 09:30:41 __init__.py:190] Automatically detected platform rocm. INFO 02-12 09:30:47 config.py:542] This model supports multiple tasks: {'embed', 'generate', 'reward', 'classify', 'score'}. Defaulting to 'generate'. INFO 02-12 09:30:50 config.py:542] This model supports multiple tasks: {'score', 'generate', 'reward', 'embed', 'classify'}. Defaulting to 'generate'. WARNING 02-12 09:30:50 config.py:621] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 02-12 09:30:50 config.py:1115] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor INFO 02-12 09:30:50 config.py:1401] Defaulting to use mp for distributed inference INFO 02-12 09:30:50 config.py:1431] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs. WARNING 02-12 09:30:54 config.py:621] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 02-12 09:30:54 config.py:1115] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor INFO 02-12 09:30:54 config.py:1401] Defaulting to use mp for distributed inference INFO 02-12 09:30:54 config.py:1431] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs. INFO 02-12 09:30:54 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='/home/user/models/Qwen2.5-72B-Instruct-GPTQ-Int8/', speculative_config=None, tokenizer='/home/user/models/Qwen2.5-72B-Instruct-GPTQ-Int8/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=fp8, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen 2.5 72B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, WARNING 02-12 09:30:54 multiproc_worker_utils.py:300] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 02-12 09:30:54 custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager INFO 02-12 09:30:54 rocm.py:92] None is not supported in AMD GPUs. INFO 02-12 09:30:54 rocm.py:93] Using ROCmFlashAttention backend. INFO 02-12 09:30:56 __init__.py:190] Automatically detected platform rocm. (VllmWorkerProcess pid=111800) INFO 02-12 09:31:01 multiproc_worker_utils.py:229] Worker ready; awaiting tasks (VllmWorkerProcess pid=111800) INFO 02-12 09:31:01 rocm.py:92] None is not supported in AMD GPUs. (VllmWorkerProcess pid=111800) INFO 02-12 09:31:01 rocm.py:93] Using ROCmFlashAttention backend. INFO 02-12 09:31:02 utils.py:950] Found nccl from library librccl.so.1 INFO 02-12 09:31:02 pynccl.py:69] vLLM is using nccl==2.21.5 (VllmWorkerProcess pid=111800) INFO 02-12 09:31:02 utils.py:950] Found nccl from library librccl.so.1 (VllmWorkerProcess pid=111800) INFO 02-12 09:31:02 pynccl.py:69] vLLM is using nccl==2.21.5 INFO 02-12 09:31:02 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_c550d8e8'), local_subscribe_port=49379, remote_subscribe_port=None) INFO 02-12 09:31:02 model_runner.py:1110] Starting to load model /home/user/models/Qwen2.5-72B-Instruct-GPTQ-Int8/... (VllmWorkerProcess pid=111800) INFO 02-12 09:31:02 model_runner.py:1110] Starting to load model /home/user/models/Qwen2.5-72B-Instruct-GPTQ-Int8/... WARNING 02-12 09:31:02 rocm.py:150] Model architecture 'Qwen2ForCausalLM' is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting `VLLM_USE_TRITON_FLASH_ATTN=0` (VllmWorkerProcess pid=111800) WARNING 02-12 09:31:02 rocm.py:150] Model architecture 'Qwen2ForCausalLM' is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting `VLLM_USE_TRITON_FLASH_ATTN=0` Loading safetensors checkpoint shards: 0% Completed | 0/20 [00:00, ?it/s] Loading safetensors checkpoint shards: 5% Completed | 1/20 [00:01(VllmWorkerProcess pid=111800) INFO 02-12 09:31:28 model_runner.py:1115] Loading model weights took 35.9147 GB INFO 02-12 09:31:28 model_runner.py:1115] Loading model weights took 35.9147 GB (VllmWorkerProcess pid=111800) INFO 02-12 09:32:33 worker.py:267] Memory profiling takes 65.50 seconds (VllmWorkerProcess pid=111800) INFO 02-12 09:32:33 worker.py:267] the current vLLM instance can use total_gpu_memory (47.98GiB) x gpu_memory_utilization (0.95) = 45.59GiB (VllmWorkerProcess pid=111800) INFO 02-12 09:32:33 worker.py:267] model weights take 35.91GiB; non_torch_memory takes 0.39GiB; PyTorch activation peak memory takes 5.25GiB; the rest of the memory reserved for KV Cache is 4.03GiB. INFO 02-12 09:32:33 worker.py:267] Memory profiling takes 65.56 seconds INFO 02-12 09:32:33 worker.py:267] the current vLLM instance can use total_gpu_memory (47.98GiB) x gpu_memory_utilization (0.95) = 45.59GiB INFO 02-12 09:32:33 worker.py:267] model weights take 35.91GiB; non_torch_memory takes 0.42GiB; PyTorch activation peak memory takes 5.25GiB; the rest of the memory reserved for KV Cache is 4.00GiB. INFO 02-12 09:32:34 executor_base.py:110] # CUDA blocks: 3277, # CPU blocks: 3276 INFO 02-12 09:32:34 executor_base.py:115] Maximum concurrency for 32768 tokens per request: 1.60x INFO 02-12 09:32:40 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. Capturing CUDA graph shapes: 0%| | 0/35 [00:00<?, ?it/s](VllmWorkerProcess pid=111800) INFO 02-12 09:32:40 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. Capturing CUDA graph shapes: 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 34/35 [00:45<00:00, 1.23it/s](VllmWorkerProcess pid=111800) INFO 02-12 09:33:27 model_runner.py:1562] Graph capturing finished in 46 secs, took 0.66 GiB Capturing CUDA graph shapes: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 35/35 [00:46<00:00, 1.33s/it] INFO 02-12 09:33:27 model_runner.py:1562] Graph capturing finished in 47 secs, took 0.66 GiB INFO 02-12 09:33:27 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 119.26 seconds INFO 02-12 09:33:27 api_server.py:756] Using supplied chat template: INFO 02-12 09:33:27 api_server.py:756] None INFO 02-12 09:33:27 launcher.py:21] Available routes are: INFO 02-12 09:33:27 launcher.py:29] Route: /openapi.json, Methods: GET, HEAD INFO 02-12 09:33:27 launcher.py:29] Route: /docs, Methods: GET, HEAD INFO 02-12 09:33:27 launcher.py:29] Route: /docs/oauth2-redirect, Methods: GET, HEAD INFO 02-12 09:33:27 launcher.py:29] Route: /redoc, Methods: GET, HEAD INFO 02-12 09:33:27 launcher.py:29] Route: /health, Methods: GET INFO 02-12 09:33:27 launcher.py:29] Route: /ping, Methods: GET, POST INFO 02-12 09:33:27 launcher.py:29] Route: /tokenize, Methods: POST INFO 02-12 09:33:27 launcher.py:29] Route: /detokenize, Methods: POST INFO 02-12 09:33:27 launcher.py:29] Route: /v1/models, Methods: GET INFO 02-12 09:33:27 launcher.py:29] Route: /version, Methods: GET INFO 02-12 09:33:27 launcher.py:29] Route: /v1/chat/completions, Methods: POST INFO 02-12 09:33:27 launcher.py:29] Route: /v1/completions, Methods: POST INFO 02-12 09:33:27 launcher.py:29] Route: /v1/embeddings, Methods: POST INFO 02-12 09:33:27 launcher.py:29] Route: /pooling, Methods: POST INFO 02-12 09:33:27 launcher.py:29] Route: /score, Methods: POST INFO 02-12 09:33:27 launcher.py:29] Route: /v1/score, Methods: POST INFO 02-12 09:33:27 launcher.py:29] Route: /rerank, Methods: POST INFO 02-12 09:33:27 launcher.py:29] Route: /v1/rerank, Methods: POST INFO 02-12 09:33:27 launcher.py:29] Route: /v2/rerank, Methods: POST INFO 02-12 09:33:27 launcher.py:29] Route: /invocations, Methods: POST INFO: Started server process [111255] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) INFO: 10.24.0.80:41608 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request INFO 02-12 09:34:15 chat_utils.py:332] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this. INFO 02-12 09:34:15 logger.py:39] Received request chatcmpl-a3efd573dda64ccbb0ffe507917bb547: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nTest<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32738, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None. INFO 02-12 09:34:15 engine.py:275] Added request chatcmpl-a3efd573dda64ccbb0ffe507917bb547. INFO 02-12 09:34:21 metrics.py:455] Avg prompt throughput: 2.1 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 02-12 09:34:21 metrics.py:471] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00% INFO: 10.24.0.80:50482 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO 02-12 09:34:33 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 02-12 09:34:33 metrics.py:471] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00% INFO 02-12 09:34:43 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 02-12 09:34:43 metrics.py:471] Prefix cache hit rate: GPU: 0.00%, CPU: 0.00% INFO 02-12 09:35:06 logger.py:39] Received request chatcmpl-ab28f556c0a440d397a5c3b16ff3c94e: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\nTest<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32738, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None. INFO 02-12 09:35:06 engine.py:275] Added request chatcmpl-ab28f556c0a440d397a5c3b16ff3c94e. CRITICAL 02-12 09:35:06 launcher.py:101] MQLLMEngine is already dead, terminating server process INFO: 10.24.0.80:55184 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop. (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] Traceback (most recent call last): (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/vllm/executor/multiproc_worker_utils.py", line 236, in _run_worker_process (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] output = run_method(worker, method, args, kwargs) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/vllm/utils.py", line 2220, in run_method (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] return func(*args, **kwargs) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/vllm/worker/worker_base.py", line 93, in start_worker_execution_loop (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] output = self.execute_model(execute_model_req=None) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/vllm/worker/worker_base.py", line 413, in execute_model (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] output = self.model_runner.execute_model( (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] return func(*args, **kwargs) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/vllm/worker/model_runner.py", line 1719, in execute_model (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] hidden_or_intermediate_states = model_executable( (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] return self._call_impl(*args, **kwargs) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] return forward_call(*args, **kwargs) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/vllm/model_executor/models/qwen2.py", line 486, in forward (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] hidden_states = self.model(input_ids, positions, kv_caches, (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/vllm/compilation/decorators.py", line 172, in call (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] return self.forward(*args, **kwargs) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/vllm/model_executor/models/qwen2.py", line 348, in forward (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] hidden_states, residual = layer( (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] return self._call_impl(*args, **kwargs) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] return forward_call(*args, **kwargs) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/vllm/model_executor/models/qwen2.py", line 247, in forward (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] hidden_states = self.self_attn( (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] return self._call_impl(*args, **kwargs) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] return forward_call(*args, **kwargs) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/vllm/model_executor/models/qwen2.py", line 179, in forward (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] attn_output = self.attn(q, k, v, kv_cache, attn_metadata) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] return self._call_impl(*args, **kwargs) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] return forward_call(*args, **kwargs) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/vllm/attention/layer.py", line 201, in forward (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] return torch.ops.vllm.unified_attention( (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/_ops.py", line 1122, in call (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] return self._op(*args, **(kwargs or {})) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/vllm/attention/layer.py", line 307, in unified_attention (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] return self.impl.forward(self, query, key, value, kv_cache, attn_metadata) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/vllm/attention/backends/rocm_flash_attn.py", line 748, in forward (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] output[:num_prefill_tokens] = PagedAttention.forward_prefix( (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/vllm/attention/ops/paged_attn.py", line 213, in forward_prefix (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] context_attention_fwd( (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] return func(*args, **kwargs) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/vllm/attention/ops/prefix_prefill.py", line 827, in context_attention_fwd (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] _fwd_kernel[grid]( (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/triton/runtime/jit.py", line 657, in run (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] kernel = self.compile( (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/triton/compiler/compiler.py", line 283, in compile (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] module = src.make_ir(options, codegen_fns, module_map, context) (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/triton/compiler/compiler.py", line 113, in make_ir (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] triton.compiler.errors.CompilationError: at 1:0: (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] def _fwd_kernel( (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] ^ (VllmWorkerProcess pid=111800) ERROR 02-12 09:35:06 multiproc_worker_utils.py:242] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e5',)") ERROR 02-12 09:35:06 engine.py:139] CompilationError('def _fwd_kernel(\n Q,\n K,\n V,\n K_cache,\n V_cache,\n B_Loc,\n sm_scale,\n k_scale,\n v_scale,\n B_Start_Loc,\n B_Seqlen,\n B_Ctxlen,\n block_size,\n x,\n Out,\n stride_b_loc_b,\n stride_b_loc_s,\n stride_qbs,\n stride_qh,\n stride_qd,\n stride_kbs,\n stride_kh,\n stride_kd,\n stride_vbs,\n stride_vh,\n stride_vd,\n stride_obs,\n stride_oh,\n stride_od,\n stride_k_cache_bs,\n stride_k_cache_h,\n stride_k_cache_d,\n stride_k_cache_bl,\n stride_k_cache_x,\n stride_v_cache_bs,\n stride_v_cache_h,\n stride_v_cache_d,\n stride_v_cache_bl,\n num_queries_per_kv: int,\n IN_PRECISION: tl.constexpr,\n BLOCK_M: tl.constexpr,\n BLOCK_DMODEL: tl.constexpr, # head size\n BLOCK_DMODEL_PADDED: tl.constexpr, # head size padded to a power of 2\n BLOCK_N: tl.constexpr,\n SLIDING_WINDOW: tl.constexpr,\n):\n cur_batch = tl.program_id(0)\n cur_head = tl.program_id(1)\n start_m = tl.program_id(2)\n\n cur_kv_head = cur_head // num_queries_per_kv\n\n cur_batch_ctx_len = tl.load(B_Ctxlen + cur_batch)\n cur_batch_seq_len = tl.load(B_Seqlen + cur_batch)\n cur_batch_in_all_start_index = tl.load(B_Start_Loc + cur_batch)\n cur_batch_query_len = cur_batch_seq_len - cur_batch_ctx_len\n\n # start position inside of the query\n # generally, N goes over kv, while M goes over query_len\n block_start_loc = BLOCK_M * start_m\n\n # initialize offsets\n # [N]; starts at 0\n offs_n = tl.arange(0, BLOCK_N)\n # [D]; starts at 0\n offs_d = tl.arange(0, BLOCK_DMODEL_PADDED)\n # [M]; starts at current position in query\n offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)\n # [M,D]\n off_q = (\n (cur_batch_in_all_start_index + offs_m[:, None]) * stride_qbs +\n cur_head * stride_qh + offs_d[None, :] * stride_qd)\n\n dim_mask = tl.where(\n tl.arange(0, BLOCK_DMODEL_PADDED) < BLOCK_DMODEL, 1,\n 0).to(tl.int1) # [D]\n\n q = tl.load(Q + off_q,\n mask=dim_mask[None, :] &\n (offs_m[:, None] < cur_batch_query_len),\n other=0.0) # [M,D]\n\n # initialize pointer to m and l\n m_i = tl.zeros([BLOCK_M], dtype=tl.float32) - float("inf") # [M]\n l_i = tl.zeros([BLOCK_M], dtype=tl.float32) # [M]\n acc = tl.zeros([BLOCK_M, BLOCK_DMODEL_PADDED],\n dtype=tl.float32) # [M,D]\n\n # compute query against context (no causal mask here)\n for start_n in range(0, cur_batch_ctx_len, BLOCK_N):\n start_n = tl.multiple_of(start_n, BLOCK_N)\n # -- compute qk ----\n bn = tl.load(B_Loc + cur_batch * stride_b_loc_b +\n ((start_n + offs_n) // block_size) * stride_b_loc_s,\n mask=(start_n + offs_n) < cur_batch_ctx_len,\n other=0) # [N]\n # [D,N]\n off_k = (bn[None, :] * stride_k_cache_bs +\n cur_kv_head * stride_k_cache_h +\n (offs_d[:, None] // x) * stride_k_cache_d +\n ((start_n + offs_n[None, :]) % block_size) *\n stride_k_cache_bl +\n (offs_d[:, None] % x) * stride_k_cache_x)\n # [N,D]\n off_v = (\n bn[:, None] * stride_v_cache_bs +\n cur_kv_head * stride_v_cache_h +\n offs_d[None, :] * stride_v_cache_d +\n (start_n + offs_n[:, None]) % block_size * stride_v_cache_bl)\n k_load = tl.load(K_cache + off_k,\n mask=dim_mask[:, None] &\n ((start_n + offs_n[None, :]) < cur_batch_ctx_len),\n other=0.0) # [D,N]\n\n if k_load.dtype.is_fp8():\n k = (k_load.to(tl.float32) * tl.load(k_scale)).to(q.dtype)\n else:\n k = k_load\n\n qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # [M,N]\n qk = tl.dot(q, k, acc=qk, input_precision=IN_PRECISION)\n qk = tl.where((start_n + offs_n[None, :]) < cur_batch_ctx_len, qk,\n float("-inf"))\n qk *= sm_scale\n if SLIDING_WINDOW > 0:\n # (cur_batch_ctx_len + offs_m[:, None]) are the positions of\n # Q entries in sequence\n # (start_n + offs_n[None, :]) are the positions of\n # KV entries in sequence\n # So the condition makes sure each entry in Q only attends\n # to KV entries not more than SLIDING_WINDOW away.\n #\n # We can't use -inf here, because the\n # sliding window may lead to the entire row being masked.\n # This then makes m_ij contain -inf, which causes NaNs in\n # exp().\n qk = tl.where((cur_batch_ctx_len + offs_m[:, None]) -\n (start_n + offs_n[None, :]) < SLIDING_WINDOW, qk,\n -10000)\n\n # -- compute m_ij, p, l_ij\n m_ij = tl.max(qk, 1) # [M]\n p = tl.exp(qk - m_ij[:, None]) # [M,N]\n l_ij = tl.sum(p, 1) # [M]\n # -- update m_i and l_i\n m_i_new = tl.maximum(m_i, m_ij) # [M]\n alpha = tl.exp(m_i - m_i_new) # [M]\n beta = tl.exp(m_ij - m_i_new) # [M]\n l_i_new = alpha * l_i + beta * l_ij # [M]\n\n # -- update output accumulator --\n # scale p\n p_scale = beta / l_i_new\n p = p * p_scale[:, None]\n # scale acc\n acc_scale = l_i / l_i_new * alpha\n acc = acc * acc_scale[:, None]\n # update acc\n v_load = tl.load(V_cache + off_v,\n mask=dim_mask[None, :] &\n ((start_n + offs_n[:, None]) < cur_batch_ctx_len),\n other=0.0) # [N,D]\n if v_load.dtype.is_fp8():\n v = (v_load.to(tl.float32) * tl.load(v_scale)).to(q.dtype)\n else:\n v = v_load\n p = p.to(v.dtype)\n\n acc = tl.dot(p, v, acc=acc, input_precision=IN_PRECISION)\n # # update m_i and l_i\n l_i = l_i_new\n m_i = m_i_new\n\n off_k = (offs_n[None, :] * stride_kbs + cur_kv_head * stride_kh +\n offs_d[:, None] * stride_kd)\n off_v = (offs_n[:, None] * stride_vbs + cur_kv_head * stride_vh +\n offs_d[None, :] * stride_vd)\n k_ptrs = K + off_k\n v_ptrs = V + off_v\n\n # block_mask is 0 when we're already past the current query length\n block_mask = tl.where(block_start_loc < cur_batch_query_len, 1, 0)\n\n # compute query against itself (with causal mask)\n for start_n in range(0, block_mask * (start_m + 1) * BLOCK_M, BLOCK_N):\n start_n = tl.multiple_of(start_n, BLOCK_N)\n # -- compute qk ----\n k = tl.load(k_ptrs +\n (cur_batch_in_all_start_index + start_n) * stride_kbs,\n mask=dim_mask[:, None] &\n ((start_n + offs_n[None, :]) < cur_batch_query_len),\n other=0.0)\n\n qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)\n qk = tl.dot(q, k, acc=qk, input_precision=IN_PRECISION)\n qk *= sm_scale\n # apply causal mask\n qk = tl.where(offs_m[:, None] >= (start_n + offs_n[None, :]), qk,\n float("-inf"))\n if SLIDING_WINDOW > 0:\n qk = tl.where(\n offs_m[:, None] - (start_n + offs_n[None, :])\n < SLIDING_WINDOW, qk, -10000)\n\n # -- compute m_ij, p, l_ij\n m_ij = tl.max(qk, 1)\n p = tl.exp(qk - m_ij[:, None])\n l_ij = tl.sum(p, 1)\n # -- update m_i and l_i\n m_i_new = tl.maximum(m_i, m_ij)\n alpha = tl.exp(m_i - m_i_new)\n beta = tl.exp(m_ij - m_i_new)\n l_i_new = alpha * l_i + beta * l_ij\n # -- update output accumulator --\n # scale p\n p_scale = beta / l_i_new\n p = p * p_scale[:, None]\n # scale acc\n acc_scale = l_i / l_i_new * alpha\n acc = acc * acc_scale[:, None]\n # update acc\n v = tl.load(v_ptrs +\n (cur_batch_in_all_start_index + start_n) * stride_vbs,\n mask=dim_mask[None, :] &\n ((start_n + offs_n[:, None]) < cur_batch_query_len),\n other=0.0)\n p = p.to(v.dtype)\n\n acc = tl.dot(p, v, acc=acc, input_precision=IN_PRECISION)\n # update m_i and l_i\n l_i = l_i_new\n m_i = m_i_new\n # initialize pointers to output\n off_o = (\n (cur_batch_in_all_start_index + offs_m[:, None]) * stride_obs +\n cur_head * stride_oh + offs_d[None, :] * stride_od)\n out_ptrs = Out + off_o\n tl.store(out_ptrs,\n acc,\n mask=dim_mask[None, :] &\n (offs_m[:, None] < cur_batch_query_len))\n return\n', <ast.FunctionDef object at 0x7f2a6fe76bc0>, 'ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e5',)")') ERROR 02-12 09:35:06 engine.py:139] Traceback (most recent call last): ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/engine/multiprocessing/engine.py", line 137, in start ERROR 02-12 09:35:06 engine.py:139] self.run_engine_loop() ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/engine/multiprocessing/engine.py", line 200, in run_engine_loop ERROR 02-12 09:35:06 engine.py:139] request_outputs = self.engine_step() ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/engine/multiprocessing/engine.py", line 218, in engine_step ERROR 02-12 09:35:06 engine.py:139] raise e ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/engine/multiprocessing/engine.py", line 209, in engine_step ERROR 02-12 09:35:06 engine.py:139] return self.engine.step() ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/engine/llm_engine.py", line 1386, in step ERROR 02-12 09:35:06 engine.py:139] outputs = self.model_executor.execute_model( ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/executor/executor_base.py", line 275, in execute_model ERROR 02-12 09:35:06 engine.py:139] driver_outputs = self._driver_execute_model(execute_model_req) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/executor/mp_distributed_executor.py", line 144, in _driver_execute_model ERROR 02-12 09:35:06 engine.py:139] return self.driver_worker.execute_model(execute_model_req) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/worker/worker_base.py", line 413, in execute_model ERROR 02-12 09:35:06 engine.py:139] output = self.model_runner.execute_model( ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 02-12 09:35:06 engine.py:139] return func(*args, **kwargs) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/worker/model_runner.py", line 1719, in execute_model ERROR 02-12 09:35:06 engine.py:139] hidden_or_intermediate_states = model_executable( ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl ERROR 02-12 09:35:06 engine.py:139] return self._call_impl(*args, **kwargs) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl ERROR 02-12 09:35:06 engine.py:139] return forward_call(*args, **kwargs) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/model_executor/models/qwen2.py", line 486, in forward ERROR 02-12 09:35:06 engine.py:139] hidden_states = self.model(input_ids, positions, kv_caches, ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/compilation/decorators.py", line 172, in call ERROR 02-12 09:35:06 engine.py:139] return self.forward(*args, **kwargs) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/model_executor/models/qwen2.py", line 348, in forward ERROR 02-12 09:35:06 engine.py:139] hidden_states, residual = layer( ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl ERROR 02-12 09:35:06 engine.py:139] return self._call_impl(*args, **kwargs) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl ERROR 02-12 09:35:06 engine.py:139] return forward_call(*args, **kwargs) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/model_executor/models/qwen2.py", line 247, in forward ERROR 02-12 09:35:06 engine.py:139] hidden_states = self.self_attn( ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl ERROR 02-12 09:35:06 engine.py:139] return self._call_impl(*args, **kwargs) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl ERROR 02-12 09:35:06 engine.py:139] return forward_call(*args, **kwargs) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/model_executor/models/qwen2.py", line 179, in forward ERROR 02-12 09:35:06 engine.py:139] attn_output = self.attn(q, k, v, kv_cache, attn_metadata) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl ERROR 02-12 09:35:06 engine.py:139] return self._call_impl(*args, **kwargs) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl ERROR 02-12 09:35:06 engine.py:139] return forward_call(*args, **kwargs) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/attention/layer.py", line 201, in forward ERROR 02-12 09:35:06 engine.py:139] return torch.ops.vllm.unified_attention( ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/_ops.py", line 1122, in call ERROR 02-12 09:35:06 engine.py:139] return self._op(*args, **(kwargs or {})) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/attention/layer.py", line 307, in unified_attention ERROR 02-12 09:35:06 engine.py:139] return self.impl.forward(self, query, key, value, kv_cache, attn_metadata) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/attention/backends/rocm_flash_attn.py", line 748, in forward ERROR 02-12 09:35:06 engine.py:139] output[:num_prefill_tokens] = PagedAttention.forward_prefix( ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/attention/ops/paged_attn.py", line 213, in forward_prefix ERROR 02-12 09:35:06 engine.py:139] context_attention_fwd( ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 02-12 09:35:06 engine.py:139] return func(*args, **kwargs) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/vllm/attention/ops/prefix_prefill.py", line 827, in context_attention_fwd ERROR 02-12 09:35:06 engine.py:139] _fwd_kernel[grid]( ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/triton/runtime/jit.py", line 330, in ERROR 02-12 09:35:06 engine.py:139] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/triton/runtime/jit.py", line 657, in run ERROR 02-12 09:35:06 engine.py:139] kernel = self.compile( ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/triton/compiler/compiler.py", line 283, in compile ERROR 02-12 09:35:06 engine.py:139] module = src.make_ir(options, codegen_fns, module_map, context) ERROR 02-12 09:35:06 engine.py:139] File "/mnt/storage/david/Development/vllm/venv/lib/python3.10/site-packages/triton/compiler/compiler.py", line 113, in make_ir ERROR 02-12 09:35:06 engine.py:139] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, ERROR 02-12 09:35:06 engine.py:139] triton.compiler.errors.CompilationError: at 1:0: ERROR 02-12 09:35:06 engine.py:139] def _fwd_kernel( ERROR 02-12 09:35:06 engine.py:139] ^ ERROR 02-12 09:35:06 engine.py:139] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e5',)") INFO: Shutting down INFO: Waiting for application shutdown. INFO: Application shutdown complete. INFO: Finished server process [111255] Process SpawnProcess-1: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 317, in _bootstrap util._exit_function() File "/usr/lib/python3.10/multiprocessing/util.py", line 357, in _exit_function p.join() File "/usr/lib/python3.10/multiprocessing/process.py", line 149, in join res = self._popen.wait(timeout) File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 43, in wait return self.poll(os.WNOHANG if timeout == 0.0 else 0) File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 27, in poll pid, sts = os.waitpid(self.pid, flag) File "/mnt/storage/david/Development/vllm/vllm/engine/multiprocessing/engine.py", line 374, in signal_handler raise KeyboardInterrupt("MQLLMEngine terminated") KeyboardInterrupt: MQLLMEngine terminated INFO 02-12 09:35:06 multiproc_worker_utils.py:141] Terminating local vLLM worker processes /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '
</details>

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

hjc4869 avatar Feb 12 '25 09:02 hjc4869

Using Prefix Prefill with fp8 kv cache requires hardware support for FP8 (through triton conversion operations), which doesn't exist on Navi3x. Disabling prefix prefill allows the use of the software emulated FP8 conversions for kv cache through the CUDA (HIP) attention kernel

gshtras avatar Feb 17 '25 21:02 gshtras

Thanks for the explanation. Closing this issue since it's by design.

hjc4869 avatar Feb 18 '25 03:02 hjc4869

Using Prefix Prefill with fp8 kv cache requires hardware support for FP8 (through triton conversion operations), which doesn't exist on Navi3x. Disabling prefix prefill allows the use of the software emulated FP8 conversions for kv cache through the CUDA (HIP) attention kernel

Thanks!

medubi avatar Apr 20 '25 20:04 medubi

How to disable prefix prefill?

delphiRo avatar Sep 08 '25 10:09 delphiRo

Now with vllm v1 it's not a matter of disabling the prefix prefill anymore. You can try using VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 to fallback to the decode attention kernel that is not in triton.

gshtras avatar Sep 08 '25 14:09 gshtras

It seems that the export VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 is not working solution in my case. I check on AMD Instinct Mi50 Rocm 6.3.4 and it log that all fp8 formats are not supported in hardware, so there is no effect of software decode attention kernel like you wrote

delphiRo avatar Sep 08 '25 16:09 delphiRo