vllm
vllm copied to clipboard
dynamic distpatch of fp8 kernels
This mostly affects ROCm with hardware that can support one or the other FP8 type. All fp8 kernels are now templated on fp8_type instead of assuming a single fp8_type via using FP8_TYPE = . CUDA is largely unaffected. For CUDA, fp8 kernels only instantiate the OCP type; no binary bloat. For ROCm, two kernel templates are instantiated, one for each fp8 type.
- FP8_TYPE is removed.
- All fp8 kernels have additional fp8_type template param.
- The FP8_E4M3_MAX is replaced with templated struct helper fp8_e4m3_adjusted_max_v.
- is_fp8_ocp() C++ function for host runtime query of preferred fp8 type.
- fp8 kernel launches for ROCm get both OCP and FNUZ templated variants.
- add python APIs current_platform.supports_fp8(), is_fp8_fnuz(), and fp8_dtype().
- add VLLM_DISPATCH_FP8_TYPES that can nest with other dispatch macros
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @jeffdaily.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
@ProExpertProg I really appreciate your thoughtful review. It greatly improved the quality.
@ProExpertProg Getting some cuda build failures:
error: identifier "fp8_e4m3_adjusted_max_v< ::c10::Float8_e4m3fn> " is undefined in device code
Looks like fp8_e4m3_adjusted_max_v is the problem, at least for nvcc. Any suggestions?
Maybe add C10_HOST_DEVICE back in, like the original constant?
@jeffdaily it looks like you missed an import in vllm.model_executor.layers.quantization.utils.fp8_utils:
ImportError: cannot import name 'current_platform_fp8_dtype' from 'vllm.model_executor.layers.quantization.utils.fp8_utils' (/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py)
Could you get that fixed, and we can enable the full CI?
@ProExpertProg fixed.
I had aggressively removed that variable from vllm/model_executor/layers/quantization/utils/fp8_utils.py because it wasn't used elsewhere in that file. I failed to grep through all the other files for anyone that might have imported it.
Yeah that was nice cleanup! @robertgshaw2-redhat @mgoin could you guys help with the fp8 support on other platforms?
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @jeffdaily.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
@jeffdaily I am planning to look into it more but do you know if this is a ROCm version issue? This PR is breaking the build on main
/mnt/nvme3n1p1/sage/git/nm-vllm/csrc/quantization/fp8/amd/quant_utils.cuh:25:33: error: use of undeclared identifier '__hip_fp8_e4m3'
25 | __hip_cvt_float_to_fp8(r, __hip_fp8_e4m3::__default_saturation,
| ^
/mnt/nvme3n1p1/sage/git/nm-vllm/csrc/quantization/fp8/amd/quant_utils.cuh:26:30: error: use of undeclared identifier '__hip_fp8_e4m3'
26 | __hip_fp8_e4m3::__default_interpret),
| ^
Looks like a ROCm version issue. __hip_fp8_e4m3 and related types were added in ROCm 6.3. Not present in ROCm 6.2. I was testing in a ROCm 6.3 environment and assumed vllm CI was also at the latest ROCm version.
Could you help with the fix? I assume there's an alternative for ROCm 6.2? It's not the CI, it's a bug in a local build
Working on a fix now. Can we forward-fix with a new PR or do you need to revert this one?
I think forward fixing is fine, the revert wouldn't be immediate either anyway. Unless you think the fix will be complicated?
Fix shouldn't be complicated. Will need to isolate the use of __hip_fp8_e4m3 behind a ROCM_VERSION macro limiting it to ROCm 6.3 or newer.
Okay ping me on vLLM Slack once done so we can merge ASAP
@ProExpertProg I'm not on slack yet. Here's the PR fix.
https://github.com/vllm-project/vllm/pull/14709