vllm [Feature] Support sequence parallelism for static fp8 quantization

Add support sequence parallelism for static fp8 quantization in this PR. It requires below config to enable it

config = CompilationConfig(level=3,
                           splitting_ops=[],
                           compile_sizes=[4],
                           custom_ops=["+rms_norm"])

# enable_noop is required to be True for correct sp pattern match 
config.pass_config.enable_noop = True
config.pass_config.enable_sequence_parallelism = True

llm = LLM(
    model="RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8",
    enforce_eager=False,
    tensor_parallel_size=2,
    compilation_config=config)

Jun 05 '25 04:06 cascade812

[!WARNING] You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Jun 05 '25 04:06 gemini-code-assist[bot]

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Jun 05 '25 04:06 github-actions[bot]

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @cascade812.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Jun 17 '25 05:06 mergify[bot]

A few asks:

I agree with @tlrmchlsmth that including fused ops might be unnecessary - could we just make this pass run before fusion, and then make sure fusion still works?

Is there any way we could make this pass more general and not reliant on the exact ops? That way it could also work if custom ops are disabled.

Perhaps here we could enable the custom ops and then lower them after the passes run, like you described in an offline conversation.

Could you post performance numbers? And should we do this for any other ops as well?

Right, it works after I move the sequence parallel pass to run before the fusion pass.
We can define a custom op that serves as a placeholder, then perform pattern matching on the custom op and lower it after the pass runs.
SP pass doesn't directly provide perf gain, it lays the groundwork for fusing matmul and collective ops like asynctp which can provide good perf gain. I can provide the perf numbers after I add asynctp for scaled mm + collective op fusion, will do it in a separate PR.
We also need similar work for dynamic fp8 ops which require different pattern match.

Jun 17 '25 05:06 cascade812

[!WARNING] You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Jun 17 '25 05:06 gemini-code-assist[bot]

[!WARNING] You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Jun 17 '25 05:06 gemini-code-assist[bot]

[!WARNING] You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Jun 17 '25 05:06 gemini-code-assist[bot]

[!WARNING] You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Jun 17 '25 06:06 gemini-code-assist[bot]

vllm vllm copied to clipboard

[Feature] Support sequence parallelism for static fp8 quantization

vllm
vllm copied to clipboard