vllm dynamic distpatch of fp8 kernels

This mostly affects ROCm with hardware that can support one or the other FP8 type. All fp8 kernels are now templated on fp8_type instead of assuming a single fp8_type via using FP8_TYPE = . CUDA is largely unaffected. For CUDA, fp8 kernels only instantiate the OCP type; no binary bloat. For ROCm, two kernel templates are instantiated, one for each fp8 type.

FP8_TYPE is removed.
All fp8 kernels have additional fp8_type template param.
The FP8_E4M3_MAX is replaced with templated struct helper fp8_e4m3_adjusted_max_v.
is_fp8_ocp() C++ function for host runtime query of preferred fp8 type.
fp8 kernel launches for ROCm get both OCP and FNUZ templated variants.
add python APIs current_platform.supports_fp8(), is_fp8_fnuz(), and fp8_dtype().
add VLLM_DISPATCH_FP8_TYPES that can nest with other dispatch macros

Mar 05 '25 00:03 jeffdaily

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Mar 05 '25 00:03 github-actions[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @jeffdaily.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Mar 05 '25 00:03 mergify[bot]

@ProExpertProg I really appreciate your thoughtful review. It greatly improved the quality.

Mar 07 '25 01:03 jeffdaily

@ProExpertProg Getting some cuda build failures:

error: identifier "fp8_e4m3_adjusted_max_v< ::c10::Float8_e4m3fn> " is undefined in device code

Looks like fp8_e4m3_adjusted_max_v is the problem, at least for nvcc. Any suggestions?

Mar 07 '25 21:03 jeffdaily

Maybe add C10_HOST_DEVICE back in, like the original constant?

Mar 07 '25 21:03 ProExpertProg

@jeffdaily it looks like you missed an import in vllm.model_executor.layers.quantization.utils.fp8_utils:

ImportError: cannot import name 'current_platform_fp8_dtype' from 'vllm.model_executor.layers.quantization.utils.fp8_utils' (/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py)

Could you get that fixed, and we can enable the full CI?

Mar 10 '25 14:03 ProExpertProg

@ProExpertProg fixed.

I had aggressively removed that variable from vllm/model_executor/layers/quantization/utils/fp8_utils.py because it wasn't used elsewhere in that file. I failed to grep through all the other files for anyone that might have imported it.

Mar 10 '25 15:03 jeffdaily

Yeah that was nice cleanup! @robertgshaw2-redhat @mgoin could you guys help with the fp8 support on other platforms?

Mar 10 '25 16:03 ProExpertProg

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @jeffdaily.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Mar 10 '25 17:03 mergify[bot]

@jeffdaily I am planning to look into it more but do you know if this is a ROCm version issue? This PR is breaking the build on main

/mnt/nvme3n1p1/sage/git/nm-vllm/csrc/quantization/fp8/amd/quant_utils.cuh:25:33: error: use of undeclared identifier '__hip_fp8_e4m3'
   25 |       __hip_cvt_float_to_fp8(r, __hip_fp8_e4m3::__default_saturation,
      |                                 ^
/mnt/nvme3n1p1/sage/git/nm-vllm/csrc/quantization/fp8/amd/quant_utils.cuh:26:30: error: use of undeclared identifier '__hip_fp8_e4m3'
   26 |                              __hip_fp8_e4m3::__default_interpret),
      |                              ^

Mar 12 '25 21:03 ProExpertProg

Looks like a ROCm version issue. __hip_fp8_e4m3 and related types were added in ROCm 6.3. Not present in ROCm 6.2. I was testing in a ROCm 6.3 environment and assumed vllm CI was also at the latest ROCm version.

Mar 12 '25 21:03 jeffdaily

Could you help with the fix? I assume there's an alternative for ROCm 6.2? It's not the CI, it's a bug in a local build

Mar 12 '25 21:03 ProExpertProg

Working on a fix now. Can we forward-fix with a new PR or do you need to revert this one?

Mar 12 '25 22:03 jeffdaily

I think forward fixing is fine, the revert wouldn't be immediate either anyway. Unless you think the fix will be complicated?

Mar 12 '25 22:03 ProExpertProg

Fix shouldn't be complicated. Will need to isolate the use of __hip_fp8_e4m3 behind a ROCM_VERSION macro limiting it to ROCm 6.3 or newer.

Mar 12 '25 22:03 jeffdaily

Okay ping me on vLLM Slack once done so we can merge ASAP

Mar 12 '25 22:03 ProExpertProg

@ProExpertProg I'm not on slack yet. Here's the PR fix.

https://github.com/vllm-project/vllm/pull/14709

Mar 12 '25 23:03 jeffdaily

vllm vllm copied to clipboard

dynamic distpatch of fp8 kernels

vllm
vllm copied to clipboard