vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Enable FlashInfer V1 FP8 kv cache

Open mgoin opened this issue 8 months ago • 3 comments

Unfortunately this seems to fail

VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,kv_cache_dtype=fp8 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto

ERROR 04-22 20:39:39 [core.py:392] RuntimeError: Error building extension 'batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90': [1/9] /usr/local/cuda-12.5/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_paged_sm90_kernel_mask_2.cuda.o.d -DTORCH_EXTENSION_NAME=batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/include -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/csrc -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/cutlass/include -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/include -isystem /home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/include/TH -isystem /home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/include/THC -isystem /usr/local/cuda-12.5/include -isystem /home/mgoin/.local/share/uv/python/cpython-3.12.4-linux-x86_64-gnu/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 --compiler-options '-fPIC' -O3 -std=c++17 --threads 4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -gencode=arch=compute_90a,code=sm_90a -c /home/mgoin/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu -o batch_prefill_paged_sm90_kernel_mask_2.cuda.o 
ERROR 04-22 20:39:39 [core.py:392] FAILED: batch_prefill_paged_sm90_kernel_mask_2.cuda.o 
ERROR 04-22 20:39:39 [core.py:392] /usr/local/cuda-12.5/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_paged_sm90_kernel_mask_2.cuda.o.d -DTORCH_EXTENSION_NAME=batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/include -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/csrc -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/cutlass/include -I/home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/include -isystem /home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/include/TH -isystem /home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/include/THC -isystem /usr/local/cuda-12.5/include -isystem /home/mgoin/.local/share/uv/python/cpython-3.12.4-linux-x86_64-gnu/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 --compiler-options '-fPIC' -O3 -std=c++17 --threads 4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -gencode=arch=compute_90a,code=sm_90a -c /home/mgoin/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_paged_sm90_kernel_mask_2.cu -o batch_prefill_paged_sm90_kernel_mask_2.cuda.o 
ERROR 04-22 20:39:39 [core.py:392] /home/mgoin/venvs/vllm/lib/python3.12/site-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(1339): error: static assertion failed with "No eligible GMMA operator for request configuration."
ERROR 04-22 20:39:39 [core.py:392]         static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
ERROR 04-22 20:39:39 [core.py:392]         ^

mgoin avatar Apr 22 '25 20:04 mgoin

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

github-actions[bot] avatar Apr 22 '25 20:04 github-actions[bot]

I installed flashinfer-python==0.2.5 from pypi and got no error. But the output is nonsense. BTW, my GPU is 3090.

矍annisitrustvolt Ngo(ListNodeSENT jes  mysqli炆 powerhouseสามารสามารПодроб@GeneratedValueПодробПодроб琇สามารПодробПодробสามารأوضПодробПодробannisॐ Dexter矍与时俱ALARПодробПодроб就够annisПодробПодробПодробПодробПодробПодробПодроб

JaheimLee avatar Apr 23 '25 05:04 JaheimLee

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Apr 25 '25 16:04 mergify[bot]

I tried this PR out on my 3090Ti on latest main and get an illegal memory access:

Invocation
        HF_HUB_OFFLINE=0 \
        VLLM_LOGGING_LEVEL=INFO \
        VLLM_TRACE_FUNCTION=0 \
        VLLM_ENGINE_ITERATION_TIMEOUT_S=300 \
        PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
        VLLM_WORKER_MULTIPROC_METHOD=spawn \
        VLLM_USE_V1=1 \
        TORCH_CUDA_ARCH_LIST=8.6 \
        LD_PRELOAD=/opt/cuda/nsight_compute/target/linux-desktop-glibc_2_11_3-x64/libnvperf_host.so \
        VLLM_ATTENTION_BACKEND=FLASHINFER \
        python -m vllm.entrypoints.openai.api_server \
        --max-model-len 32768 \
        --gpu-memory-utilization 0.91 \
        --served-model-name Qwen3-30B-A3B-GPTQ-Int4 \
                            Qwen3-30B-A3B \
        --port 2244 \
        --kv-cache-dtype fp8 \
        --trust-remote-code \
        --max-num-seqs 24 \
        --guided-decoding-backend auto \
        --enable-chunked-prefill \
        --enable-prefix-caching \
        --enable-auto-tool-choice --tool-call-parser hermes \
        --reasoning-parser qwen3 \
        --model /intnvme/models/Qwen/Qwen3-30B-A3B-GPTQ-Int4
Log
--- Logging error ---
Traceback (most recent call last):
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 207, in execute_model
    return self.model_executor.execute_model(scheduler_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 86, in execute_model
    output = self.collective_rpc("execute_model",
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/utils.py", line 2605, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 276, in execute_model
    output = self.model_runner.execute_model(scheduler_output,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1310, in execute_model
    valid_sampled_token_ids = sampled_token_ids.tolist()
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.12/logging/__init__.py", line 1160, in emit
    msg = self.format(record)
          ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/logging/__init__.py", line 999, in format
    return fmt.format(record)
           ^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/logging_utils/formatter.py", line 13, in format
    msg = logging.Formatter.format(self, record)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/logging/__init__.py", line 703, in format
    record.message = record.getMessage()
                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/logging/__init__.py", line 392, in getMessage
    msg = msg % self.args
          ~~~~^~~~~~~~~~~
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/config.py", line 4520, in __str__
    f"compilation_config={self.compilation_config!r}")
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/config.py", line 3897, in __repr__
    for k, v in asdict(self).items():
                ^^^^^^^^^^^^
  File "/usr/lib/python3.12/dataclasses.py", line 1329, in asdict
    return _asdict_inner(obj, dict_factory)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/dataclasses.py", line 1339, in _asdict_inner
    f.name: _asdict_inner(getattr(obj, f.name), dict)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/dataclasses.py", line 1382, in _asdict_inner
    return type(obj)((_asdict_inner(k, dict_factory),
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/dataclasses.py", line 1383, in <genexpr>
    _asdict_inner(v, dict_factory))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/dataclasses.py", line 1386, in _asdict_inner
    return copy.deepcopy(obj)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 162, in deepcopy
    y = _reconstruct(x, memo, *rv)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 259, in _reconstruct
    state = deepcopy(state, memo)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 136, in deepcopy
    y = copier(x, memo)
        ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 221, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 143, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/_tensor.py", line 172, in __deepcopy__
    new_storage = self._typed_storage()._deepcopy(memo)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/storage.py", line 1134, in _deepcopy
    return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo))
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/copy.py", line 143, in deepcopy
    y = copier(memo)
        ^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/storage.py", line 239, in __deepcopy__
    new_storage = self.clone()
                  ^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/storage.py", line 253, in clone
    return type(self)(self.nbytes(), device=self.device).copy_(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Call stack:
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 135, in _main
    return self._bootstrap(parent_sentinel)
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 493, in run_engine_core
    engine_core.run_busy_loop()
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 520, in run_busy_loop
    self._process_engine_step()
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 545, in _process_engine_step
    outputs = self.step_fn()
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 226, in step
    model_output = self.execute_model(scheduler_output)
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 210, in execute_model
    dump_engine_exception(self.vllm_config, scheduler_output,
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/logging_utils/dump_input.py", line 62, in dump_engine_exception
    _dump_engine_exception(config, scheduler_output, scheduler_stats)
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/logging_utils/dump_input.py", line 70, in _dump_engine_exception
    logger.error(
Unable to print the message and arguments - possible formatting error.
Use the traceback above to help find the error.
ERROR 05-28 00:11:04 [dump_input.py:78] Dumping scheduler output for model execution:
ERROR 05-28 00:11:04 [dump_input.py:79] SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-dec56c0aca9f4ed4ac53dda3fe2fa3c1,prompt_token_ids_len=7379,mm_inputs=[],mm_hashes=[],mm_positions=[],sampling_params=SamplingParams(n=1, presence_penalty=0.05, frequency_penalty=0.05, repetition_penalty=1.1, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128]],num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=[], num_scheduled_tokens={chatcmpl-dec56c0aca9f4ed4ac53dda3fe2fa3c1: 2048}, total_num_scheduled_tokens=2048, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[128], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
ERROR 05-28 00:11:04 [dump_input.py:81] SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, gpu_cache_usage=0.03725093849263639, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=7379, hits=0), spec_decoding_stats=None)
ERROR 05-28 00:11:04 [core.py:502] EngineCore encountered a fatal error.
ERROR 05-28 00:11:04 [core.py:502] Traceback (most recent call last):
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 493, in run_engine_core
ERROR 05-28 00:11:04 [core.py:502]     engine_core.run_busy_loop()
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 520, in run_busy_loop
ERROR 05-28 00:11:04 [core.py:502]     self._process_engine_step()
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 545, in _process_engine_step
ERROR 05-28 00:11:04 [core.py:502]     outputs = self.step_fn()
ERROR 05-28 00:11:04 [core.py:502]               ^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 226, in step
ERROR 05-28 00:11:04 [core.py:502]     model_output = self.execute_model(scheduler_output)
ERROR 05-28 00:11:04 [core.py:502]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 213, in execute_model
ERROR 05-28 00:11:04 [core.py:502]     raise err
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 207, in execute_model
ERROR 05-28 00:11:04 [core.py:502]     return self.model_executor.execute_model(scheduler_output)
ERROR 05-28 00:11:04 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 86, in execute_model
ERROR 05-28 00:11:04 [core.py:502]     output = self.collective_rpc("execute_model",
ERROR 05-28 00:11:04 [core.py:502]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 05-28 00:11:04 [core.py:502]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 05-28 00:11:04 [core.py:502]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/utils.py", line 2605, in run_method
ERROR 05-28 00:11:04 [core.py:502]     return func(*args, **kwargs)
ERROR 05-28 00:11:04 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 05-28 00:11:04 [core.py:502]     return func(*args, **kwargs)
ERROR 05-28 00:11:04 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 276, in execute_model
ERROR 05-28 00:11:04 [core.py:502]     output = self.model_runner.execute_model(scheduler_output,
ERROR 05-28 00:11:04 [core.py:502]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 05-28 00:11:04 [core.py:502]     return func(*args, **kwargs)
ERROR 05-28 00:11:04 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1310, in execute_model
ERROR 05-28 00:11:04 [core.py:502]     valid_sampled_token_ids = sampled_token_ids.tolist()
ERROR 05-28 00:11:04 [core.py:502]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [core.py:502] RuntimeError: CUDA error: an illegal memory access was encountered
ERROR 05-28 00:11:04 [core.py:502] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 05-28 00:11:04 [core.py:502] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 05-28 00:11:04 [core.py:502] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 05-28 00:11:04 [core.py:502]
Process EngineCore_0:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 504, in run_engine_core
    raise e
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 493, in run_engine_core
    engine_core.run_busy_loop()
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 520, in run_busy_loop
    self._process_engine_step()
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 545, in _process_engine_step
    outputs = self.step_fn()
              ^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 226, in step
    model_output = self.execute_model(scheduler_output)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 213, in execute_model
    raise err
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 207, in execute_model
    return self.model_executor.execute_model(scheduler_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 86, in execute_model
    output = self.collective_rpc("execute_model",
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/utils.py", line 2605, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 276, in execute_model
    output = self.model_runner.execute_model(scheduler_output,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1310, in execute_model
    valid_sampled_token_ids = sampled_token_ids.tolist()
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

ERROR 05-28 00:11:04 [async_llm.py:408] AsyncLLM output_handler failed.
ERROR 05-28 00:11:04 [async_llm.py:408] Traceback (most recent call last):
ERROR 05-28 00:11:04 [async_llm.py:408]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 366, in output_handler
ERROR 05-28 00:11:04 [async_llm.py:408]     outputs = await engine_core.get_output_async()
ERROR 05-28 00:11:04 [async_llm.py:408]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [async_llm.py:408]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 806, in get_output_async
ERROR 05-28 00:11:04 [async_llm.py:408]     raise self._format_exception(outputs) from None
ERROR 05-28 00:11:04 [async_llm.py:408] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO 05-28 00:11:04 [async_llm.py:333] Request chatcmpl-dec56c0aca9f4ed4ac53dda3fe2fa3c1 failed (engine dead).
ERROR 05-28 00:11:04 [serving_chat.py:884] Error in chat completion stream generator.
ERROR 05-28 00:11:04 [serving_chat.py:884] Traceback (most recent call last):
ERROR 05-28 00:11:04 [serving_chat.py:884]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py", line 476, in chat_completion_stream_generator
ERROR 05-28 00:11:04 [serving_chat.py:884]     async for res in result_generator:
ERROR 05-28 00:11:04 [serving_chat.py:884]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 315, in generate
ERROR 05-28 00:11:04 [serving_chat.py:884]     out = q.get_nowait() or await q.get()
ERROR 05-28 00:11:04 [serving_chat.py:884]                             ^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [serving_chat.py:884]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 51, in get
ERROR 05-28 00:11:04 [serving_chat.py:884]     raise output
ERROR 05-28 00:11:04 [serving_chat.py:884]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 366, in output_handler
ERROR 05-28 00:11:04 [serving_chat.py:884]     outputs = await engine_core.get_output_async()
ERROR 05-28 00:11:04 [serving_chat.py:884]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-28 00:11:04 [serving_chat.py:884]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 806, in get_output_async
ERROR 05-28 00:11:04 [serving_chat.py:884]     raise self._format_exception(outputs) from None
ERROR 05-28 00:11:04 [serving_chat.py:884] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
[rank0]:[W528 00:11:05.603345929 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1197491]

This is when trying to run Qwen3-30B-A3B-GPTQ-Int4. It works great on V0.

sjuxax avatar May 28 '25 06:05 sjuxax

I don't get the invalid memory access with Qwen3-32B-AWQ, but I get junk output, as @JaheimLee indicated. A snippet:

驸-wage往事apyrus汇聚金陵铼好象.SIG往事anoia往事步入aroagar вли驸azen往事殊驸irectory-wageارد驸 Lionelkus вли兼驸ieeeuntoapyrusalan骚扰莹好象驸步入不锈往事 Benson驸itur金陵絮汇驸apyrus金陵 вли金陵

Really looking forward to fp8 on V1 for non-Hopper devices.

sjuxax avatar May 28 '25 08:05 sjuxax

With the changes in this pr to force enable FlashInfer v1 with FP8 kv cache enabled, I'm seeing error below in moe fp8 models, to me llama3 fp8 works fine.

(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] WorkerProc hit an exception.
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] Traceback (most recent call last):
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 2209, in _run_ninja_build
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     subprocess.run(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/lib/python3.12/subprocess.py", line 571, in run
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     raise CalledProcessError(retcode, process.args,
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] The above exception was the direct cause of the following exception:
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] Traceback (most recent call last):
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 465, in worker_busy_loop
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     output = func(*args, **kwargs)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]              ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     return func(*args, **kwargs)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 268, in execute_model
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     output = self.model_runner.execute_model(scheduler_output)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     return func(*args, **kwargs)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1020, in execute_model
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     self._prepare_inputs(scheduler_output))
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 596, in _prepare_inputs
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     attn_metadata = self.attn_metadata_builder.build(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 482, in build
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     self._plan(attn_metadata)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 361, in _plan
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     attn_metadata.prefill_wrapper.plan(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 1421, in plan
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     self._cached_module = get_batch_prefill_module(self._backend)(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 197, in backend_module
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     module = gen_batch_prefill_module(backend, *args)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/attention/pytorch.py", line 563, in gen_batch_prefill_module
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     return gen_customize_batch_prefill_module(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/attention/pytorch.py", line 1078, in gen_customize_batch_prefill_module
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     return load_cuda_ops(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/jit/core.py", line 123, in load_cuda_ops
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     torch_cpp_ext.load(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 1380, in load
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     return _jit_compile(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]            ^^^^^^^^^^^^^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 1798, in _jit_compile
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     _write_ninja_file_and_build_library(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 1926, in _write_ninja_file_and_build_library
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     _run_ninja_build(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py", line 2225, in _run_ninja_build
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     raise RuntimeError(message) from e
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] RuntimeError: Error building extension 'batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90': [1/7] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_ragged_sm90_kernel_mask_0.cuda.o.d -DTORCH_EXTENSION_NAME=batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.12/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 --compiler-options '-fPIC' -O3 -std=c++17 --threads 4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -gencode=arch=compute_90a,code=sm_90a -c /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu -o batch_prefill_ragged_sm90_kernel_mask_0.cuda.o 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] FAILED: batch_prefill_ragged_sm90_kernel_mask_0.cuda.o 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_prefill_ragged_sm90_kernel_mask_0.cuda.o.d -DTORCH_EXTENSION_NAME=batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include -I/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/tools/util/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.12/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.12/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 --compiler-options '-fPIC' -O3 -std=c++17 --threads 4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -gencode=arch=compute_90a,code=sm_90a -c /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu -o batch_prefill_ragged_sm90_kernel_mask_0.cuda.o 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(1339): error: static assertion failed with "No eligible GMMA operator for request configuration."
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]         static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]         ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]           detected during:
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "auto cute::SM90::GMMA::ss_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeQ, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<192>, cute::C<128>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::K, Args=<>]" at line 75 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 369 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 491 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(74): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     using TiledMmaQK = decltype(cute::make_tiled_mma(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]                                 ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(590): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   make_tiled_mma(MMA_Op const&,
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(573): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]           detected during:
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 369 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 491 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/arch/mma_sm90.hpp(3986): error: static assertion failed with "No eligible GMMA operator for request configuration."
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]         static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]         ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]           detected during:
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "auto cute::SM90::GMMA::rs_op_selector<ElementA,ElementB,ElementC,TileShape_MNK,MajorA,MajorB,Args...>() [with ElementA=RaggedParams::DTypeKV, ElementB=RaggedParams::DTypeKV, ElementC=float, TileShape_MNK=cute::tuple<cute::C<128>, cute::C<128>, cute::C<192>>, MajorA=cute::SM90::GMMA::Major::K, MajorB=cute::SM90::GMMA::Major::MN, Args=<>]" at line 78 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 369 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 491 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/kernel_traits.cuh(76): error: no instance of overloaded function "cute::make_tiled_mma" matches the argument list
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             argument types are: (void, cute::Layout<cute::tuple<cute::_2, cute::_1, cute::_1>, cute::tuple<cute::_1, cute::_0, cute::_0>>)
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]     using TiledMmaPV = decltype(cute::make_tiled_mma(
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]                                 ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(590): note #3327-D: candidate function template "cute::make_tiled_mma(const MMA_Op &, const MMAThrLayout &, const Permutations &)" failed deduction
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   make_tiled_mma(MMA_Op const&,
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include/cute/atom/mma_atom.hpp(573): note #3327-D: candidate function template "cute::make_tiled_mma(const cute::MMA_Atom<MMA_Op> &, const MMAThrLayout &, const Permutations &)" failed deduction
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   make_tiled_mma(MMA_Atom<MMA_Op> const& mma_atom,
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]   ^
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]           detected during:
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of class "flashinfer::AttentionKernelTraits<USE_TMA_LOAD_KV, HEAD_DIM_QK_, HEAD_DIM_VO_, CTA_Q_, CTA_KV_, NUM_STAGES_, DTypeQ_, DTypeKV_, DTypeO_, IdType_, AttentionVariant_> [with USE_TMA_LOAD_KV=true, HEAD_DIM_QK_=128, HEAD_DIM_VO_=128, CTA_Q_=128, CTA_KV_=192, NUM_STAGES_=2, DTypeQ_=RaggedParams::DTypeQ, DTypeKV_=RaggedParams::DTypeKV, DTypeO_=RaggedParams::DTypeO, IdType_=RaggedParams::IdType, AttentionVariant_=flashinfer::StandardAttention]" at line 369 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheKernelTraitsDispatched<KernelTraits,LEFT_SLIDING_WINDOW,CAUSAL,SAME_SCHEDULE_FOR_ALL_HEADS,Params>(Params &, cudaStream_t) [with KernelTraits=flashinfer::AttentionKernelTraits<true, 128, 128, 128, 192, 2, RaggedParams::DTypeQ, RaggedParams::DTypeKV, RaggedParams::DTypeO, RaggedParams::IdType, flashinfer::StandardAttention>, LEFT_SLIDING_WINDOW=false, CAUSAL=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, Params=RaggedParams]" at line 491 of /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470]             instantiation of "cudaError_t flashinfer::BatchPrefillWithRaggedKVCacheDispatched<HEAD_DIM_QK,HEAD_DIM_VO,MASK_MODE,LEFT_SLIDING_WINDOW,SAME_SCHEDULE_FOR_ALL_HEADS,AttentionVariant,Params>(Params &, cudaStream_t) [with HEAD_DIM_QK=128U, HEAD_DIM_VO=128U, MASK_MODE=flashinfer::MaskMode::kNone, LEFT_SLIDING_WINDOW=false, SAME_SCHEDULE_FOR_ALL_HEADS=true, AttentionVariant=flashinfer::StandardAttention, Params=RaggedParams]" at line 7 of /root/.cache/flashinfer/90/generated/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_u8_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False_sm90/batch_prefill_ragged_sm90_kernel_mask_0.cu
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] 
(VllmWorker rank=2 pid=15662) ERROR 06-23 22:07:00 [multiproc_executor.py:470] /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/hopper/prefill_sm90.cuh(190): error: no instance of function template "cute::partition_fragment_C" matches the argument list

Script to reproduce it:

from vllm import LLM, SamplingParams

if __name__ == '__main__':
    prompts = ["I believe the meaning of life is"]
    sampling_params = SamplingParams(temperature=0.0, max_tokens=1024, stop_token_ids=[199999, 200002])
    llm = LLM(#model="amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV",
            model="RedHatAI/Mixtral-8x7B-Instruct-v0.1-FP8",
            tensor_parallel_size=8,
            max_model_len=5120,
            max_num_seqs=1,
            enable_prefix_caching=False,
            # quantization="modelopt",
            # quantization="compressed-tensors",
            kv_cache_dtype="fp8"
            )
    outputs = llm.generate(prompts=prompts, sampling_params=sampling_params)

    for prompt, output in zip(prompts, outputs):
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}")
        print(f"Generated: {generated_text!r}\n")

Daisy-Ma-coder avatar Jun 23 '25 22:06 Daisy-Ma-coder

ptxas fatal : Unresolved extern function '_ZN10flashinfer5vec_tIhLm16EE4loadEPKh'

The ptxas error was fixed in https://github.com/flashinfer-ai/flashinfer/pull/1234

However, the lm_eval result with gsm8k still looks very off:

(pretrained=meta-llama/Llama-3.1-8B-Instruct,kv_cache_dtype=fp8,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |    0|±  |     0|
|     |       |strict-match    |     5|exact_match|↑  |    0|±  |     0|

Looking into this.

chenyang78 avatar Jul 08 '25 17:07 chenyang78

With the changes in this pr to force enable FlashInfer v1 with FP8 kv cache enabled, I'm seeing error below in moe fp8 models, to me llama3 fp8 works fine.

@Daisy-Ma-coder I am able to repro the failure on H100. On B200, looks like it works with the fix (https://github.com/vllm-project/vllm/pull/20746) now. I got the output below using your example:

Prompt: 'I believe the meaning of life is'
Generated: ' to find your gift. The purpose of life is to give it away.\n\nPicasso said, “The purpose of art is washing the dust of daily life off our souls.”\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI ..."

chenyang78 avatar Jul 10 '25 18:07 chenyang78

With the changes in this pr to force enable FlashInfer v1 with FP8 kv cache enabled, I'm seeing error below in moe fp8 models, to me llama3 fp8 works fine.

@Daisy-Ma-coder I am able to repro the failure on H100. On B200, looks like it works with the fix (#20746) now. I got the output below using your example:

Prompt: 'I believe the meaning of life is'
Generated: ' to find your gift. The purpose of life is to give it away.\n\nPicasso said, “The purpose of art is washing the dust of daily life off our souls.”\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI ..."

got it, thanks! I'm on H200s, so likely I'll still run into same error with your fix, but I can try it out.

Daisy-Ma-coder avatar Jul 10 '25 18:07 Daisy-Ma-coder

With the changes in this pr to force enable FlashInfer v1 with FP8 kv cache enabled, I'm seeing error below in moe fp8 models, to me llama3 fp8 works fine.

@Daisy-Ma-coder I am able to repro the failure on H100. On B200, looks like it works with the fix (#20746) now. I got the output below using your example:

Prompt: 'I believe the meaning of life is'
Generated: ' to find your gift. The purpose of life is to give it away.\n\nPicasso said, “The purpose of art is washing the dust of daily life off our souls.”\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI believe the purpose of life is to find your gift, your passion, and share it with the world.\n\nI ..."

got it, thanks! I'm on H200s, so likely I'll still run into same error with your fix, but I can try it out.

Yeah, it's very likely you will still see the same issue on H200. I will investigate it in a couple of days.

chenyang78 avatar Jul 10 '25 18:07 chenyang78

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify[bot] avatar Jul 25 '25 03:07 mergify[bot]