vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm

Open charlifu opened this issue 1 year ago • 16 comments

This PR adds fp8 linear layer support on Rocm.

  • Use torch.float8_e4m3fnuz for fp8 data type for rocm, and update the fp8 conversion kernels accordingly.
  • Adjust the weight from torch.float8_e4m3fn to torch.float8_e4m3fnuz after weight loading, and adjust the scaling factor as well.
  • Since rocm uses torch 2.5, in which _scaled_mm returns single value, do a condition check when return the result from _scaled_mm.

Tested on llama-2-70b-chat-hf-fp8 model generated by NeuralMagic/AutoFP8.

Prompt: 'The president of the United States is', Generated text: " the head of the executive branch and the highest-ranking official in the federal government. The president is elected by the people through the Electoral College and serves a four-year term. The president's primary responsibilities include:\n\n1. Serving as the commander-in-chief of the armed forces\n2. Nominating and, with the advice and consent of the Senate, appointing federal judges, including Supreme Court justices\n3. Signing or vetoing bills passed by Congress\n4. Conducting foreign policy and negotiating treaties on behalf of the United States\n5. Appointing ambassadors and other high-ranking officials\n6. Making executive orders, which have the force of law\n7. Addressing the nation and Congress on important issues\n8. Leading and coordinating the response to national emergencies and natural disasters\n\nThe president also has the power to grant pardons and reprieves, except in cases of impeachment.\n\nThe vice president of the United States is the second-highest-ranking official in the federal government and serves as the president's deputy. The vice president is also elected through the Electoral College and serves a four-year term. The vice president's primary responsibilities include:\n\n1. Assisting the president in their duties\n2. Presiding over the Senate, casting tie-breaking votes in the case of a deadlock\n3. Assuming the presidency if the president is unable to serve, either through death, resignation, or removal from office\n4. Serving as a member of the National Security Council\n5. Representing the United States at official events and ceremonies\n6. Participating in Cabinet meetings and offering advice to the president on policy matters\n\nThe vice president also has the power to succeed the president if the office becomes vacant, either through death, resignation, or removal from office.\n\nThe Cabinet is a group of high-ranking officials who are appointed by the president and confirmed by the Senate. The Cabinet members serve as the heads of the 15 executive departments, which are responsible for carrying out the day-to-day operations of the federal government. The Cabinet members also advise the president on policy matters and help to implement the president's agenda.\n\nThe Cabinet members include:\n\n1."

charlifu avatar Aug 06 '24 16:08 charlifu

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

github-actions[bot] avatar Aug 06 '24 16:08 github-actions[bot]

/ready

charlifu avatar Aug 06 '24 18:08 charlifu

FYI you may want to look at https://github.com/vllm-project/vllm/pull/7233 to review the usage of hipcub for reduction

mgoin avatar Aug 07 '24 16:08 mgoin

@HaiShaw - are there any limitations for ops.scaled_fp8_quant on hip? Can we cover all cases?

  • static
  • dynamic per tensor
  • dynamic per token

It should cover all. Will test. Now only static is tested.

charlifu avatar Aug 07 '24 17:08 charlifu

@HaiShaw - are there any limitations for ops.scaled_fp8_quant on hip? Can we cover all cases?

  • static
  • dynamic per tensor
  • dynamic per token

It should cover all. Will test. Now only static is tested.

Thanks - if you can verify this it would really simplify the implementation since we can then cover all cases and easily turn this on in every fp8 frontend

robertgshaw2-redhat avatar Aug 07 '24 18:08 robertgshaw2-redhat

@HaiShaw - are there any limitations for ops.scaled_fp8_quant on hip? Can we cover all cases?

  • static
  • dynamic per tensor
  • dynamic per token

@robertgshaw2-neuralmagic , as @charlifu said, I didn't spot an issue why not, we do need to fix the FP8_E4M3_MAX on HIP though. By the way, do we have test scripts to validate per-token (low or high level verification)? One more, scaled_fp8_conversion and scaled_fp8_conversion_vec are not optimal as it is (on HIP), it doesn't take advantage of AMD's V_CVT_PK_FP8 instruction AMD Instinct MI300 Instruction Set Architecture, we need an improvement there.

HaiShaw avatar Aug 07 '24 19:08 HaiShaw

@HaiShaw - are there any limitations for ops.scaled_fp8_quant on hip? Can we cover all cases?

  • static
  • dynamic per tensor
  • dynamic per token

@robertgshaw2-neuralmagic , as @charlifu said, I didn't spot an issue why not, we do need to fix the FP8_E4M3_MAX on HIP though. By the way, do we have test scripts to validate per-token (low or high level verification)? One more, scaled_fp8_conversion and scaled_fp8_conversion_vec are not optimal as it is (on HIP), it doesn't take advantage of AMD's V_CVT_PK_FP8 instruction AMD Instinct MI300 Instruction Set Architecture, we need an improvement there.

When you say "validate" - do you mean testing?

We have some models that we can hook up for integration testing as well as unit tests in tests/kernels

robertgshaw2-redhat avatar Aug 07 '24 19:08 robertgshaw2-redhat

@HaiShaw - are there any limitations for ops.scaled_fp8_quant on hip? Can we cover all cases?

  • static
  • dynamic per tensor
  • dynamic per token

@robertgshaw2-neuralmagic , as @charlifu said, I didn't spot an issue why not, we do need to fix the FP8_E4M3_MAX on HIP though. By the way, do we have test scripts to validate per-token (low or high level verification)? One more, scaled_fp8_conversion and scaled_fp8_conversion_vec are not optimal as it is (on HIP), it doesn't take advantage of AMD's V_CVT_PK_FP8 instruction AMD Instinct MI300 Instruction Set Architecture, we need an improvement there.

When you say "validate" - do you mean testing?

We have some models that we can hook up for integration testing as well as unit tests in tests/kernels

@HaiShaw is there any way I could have access to a dev environment? I wanted to add this to compressed-tensors

robertgshaw2-redhat avatar Aug 07 '24 20:08 robertgshaw2-redhat

@HaiShaw - are there any limitations for ops.scaled_fp8_quant on hip? Can we cover all cases?

  • static
  • dynamic per tensor
  • dynamic per token

@robertgshaw2-neuralmagic , as @charlifu said, I didn't spot an issue why not, we do need to fix the FP8_E4M3_MAX on HIP though. By the way, do we have test scripts to validate per-token (low or high level verification)? One more, scaled_fp8_conversion and scaled_fp8_conversion_vec are not optimal as it is (on HIP), it doesn't take advantage of AMD's V_CVT_PK_FP8 instruction AMD Instinct MI300 Instruction Set Architecture, we need an improvement there.

When you say "validate" - do you mean testing? We have some models that we can hook up for integration testing as well as unit tests in tests/kernels

@HaiShaw is there any way I could have access to a dev environment? I wanted to add this to compressed-tensors

@robertgshaw2-neuralmagic - Yes, I meant testing and at end2end level use-cases as well. For the dev environment, do you have access to @WoosukKwon 's system?

HaiShaw avatar Aug 07 '24 20:08 HaiShaw

@HaiShaw - are there any limitations for ops.scaled_fp8_quant on hip? Can we cover all cases?

  • static
  • dynamic per tensor
  • dynamic per token

@robertgshaw2-neuralmagic , as @charlifu said, I didn't spot an issue why not, we do need to fix the FP8_E4M3_MAX on HIP though. By the way, do we have test scripts to validate per-token (low or high level verification)? One more, scaled_fp8_conversion and scaled_fp8_conversion_vec are not optimal as it is (on HIP), it doesn't take advantage of AMD's V_CVT_PK_FP8 instruction AMD Instinct MI300 Instruction Set Architecture, we need an improvement there.

When you say "validate" - do you mean testing? We have some models that we can hook up for integration testing as well as unit tests in tests/kernels

@HaiShaw is there any way I could have access to a dev environment? I wanted to add this to compressed-tensors

@robertgshaw2-neuralmagic - Yes, I meant testing and at end2end level use-cases as well. For the dev environment, do you have access to @WoosukKwon 's system?

Ill ask Woosuk if I can borrow

  • For kernel tests -> https://github.com/vllm-project/vllm/blob/main/tests/kernels/quant_utils.py#L56
  • For example models with dynamic per token, I need to turn on the compressed-tensors implementation

robertgshaw2-redhat avatar Aug 07 '24 20:08 robertgshaw2-redhat

@mgoin @robertgshaw2-neuralmagic @HaiShaw Some updates:

  • I tested this PR on dynamic quantization and found using 240.0 for max value will cause accuracy issue, and using 224.0 will not.
  • I updated the quant_utils.py to make it work on rocm, now all test cases in test_fp8_quant.py passes.
  • Dynamic per tensor quantization tested on llama-2-7b and llama-2-70b.

charlifu avatar Aug 08 '24 15:08 charlifu

@mgoin @robertgshaw2-neuralmagic @HaiShaw Some updates:

  • I tested this PR on dynamic quantization and found using 240.0 for max value will cause accuracy issue, and using 224.0 will not.
  • I updated the quant_utils.py to make it work on rocm, now all test cases in test_fp8_quant.py passes.
  • Dynamic per tensor quantization tested on llama-2-7b and llama-2-70b.

Great! Do you guys have ability to run any integration tests in the CI or no?

robertgshaw2-redhat avatar Aug 08 '24 15:08 robertgshaw2-redhat

@mgoin @robertgshaw2-neuralmagic @HaiShaw Some updates:

  • I tested this PR on dynamic quantization and found using 240.0 for max value will cause accuracy issue, and using 224.0 will not.
  • I updated the quant_utils.py to make it work on rocm, now all test cases in test_fp8_quant.py passes.
  • Dynamic per tensor quantization tested on llama-2-7b and llama-2-70b.

Great! Do you guys have ability to run any integration tests in the CI or no?

Can you point out which tests we should enable on rocm for this PR? Now only test_fp8_quant.py is enabled.

charlifu avatar Aug 08 '24 18:08 charlifu

@charlifu - what setup do you use for development? Do you have any instructions for setting up a dev enviornment?

robertgshaw2-redhat avatar Aug 08 '24 19:08 robertgshaw2-redhat

@mgoin @robertgshaw2-neuralmagic @HaiShaw Some updates:

  • I tested this PR on dynamic quantization and found using 240.0 for max value will cause accuracy issue, and using 224.0 will not.
  • I updated the quant_utils.py to make it work on rocm, now all test cases in test_fp8_quant.py passes.
  • Dynamic per tensor quantization tested on llama-2-7b and llama-2-70b.

Great! Do you guys have ability to run any integration tests in the CI or no?

Can you point out which tests we should enable on rocm for this PR? Now only test_fp8_quant.py is enabled.

We have end to end correctness tests running through LM-eval-harness. We could enable a subset of these for ROCM (especially the FP8 ones)

robertgshaw2-redhat avatar Aug 08 '24 19:08 robertgshaw2-redhat

@charlifu - what setup do you use for development? Do you have any instructions for setting up a dev enviornment?

Please refer to Link

charlifu avatar Aug 08 '24 19:08 charlifu

@robertgshaw2-neuralmagic I am able to run lm-evaluation-harness on my dev machine. Can you give me a list of models and tasks you want to see the accuracy so I can get the results for you?

charlifu avatar Aug 13 '24 14:08 charlifu

Hi @charlifu thanks for offering! Here are a few models where I've linked to the evaluation section so you can see the lm-eval command used and the expected scores:

These should all be in the AutoFP8 static per-tensor scale checkpoint format. Hopefully we can get you integrated into the compressed-tensors backend in followup work so we can begin testing more advanced dynamic per-token activations and per-channel weights.

mgoin avatar Aug 13 '24 16:08 mgoin

@charlifu , get a chance to enable V_CVT_PK_FP8 in scaled_fp8_conversion_vec on MI30x? Seems necessary for performance. @mgoin , for model evaluation, we could leave MoE: Mixtral to later followup, from different compute path.

HaiShaw avatar Aug 13 '24 17:08 HaiShaw

@mgoin @robertgshaw2-neuralmagic additionally we plan to support https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 soon (we already supports https://huggingface.co/amd/Meta-Llama-3.1-405B-Instruct-fp8-quark-vllm but not from this PR), when FBGEMM-FP8 (dynamic per-token activations and per-channel weights) support is ready, FYI.

HaiShaw avatar Aug 13 '24 17:08 HaiShaw

This PR adds fp8 linear layer support on Rocm.

  • Use torch.float8_e4m3fnuz for fp8 data type for rocm, and update the fp8 conversion kernels accordingly.
  • Adjust the weight from torch.float8_e4m3fn to torch.float8_e4m3fnuz after weight loading, and adjust the scaling factor as well.
  • Since rocm uses torch 2.5, in which _scaled_mm returns single value, do a condition check when return the result from _scaled_mm.

Evaluation results: Meta-Llama-3-8B-Instruct-FP8

Tasks Version Filter n-shot Metric Value Stderr Open LLM Leaderboard N/A

  • arc_challenge 1 none 25 acc ↑ 0.5776 ± 0.0144 none 25 acc_norm ↑ 0.6203 ± 0.0142
  • gsm8k 3 flexible-extract 5 exact_match ↑ 0.7597 ± 0.0118 strict-match 5 exact_match ↑ 0.7627 ± 0.0117
  • hellaswag 1 none 10 acc ↑ 0.5879 ± 0.0049 none 10 acc_norm ↑ 0.7843 ± 0.0041
  • mmlu 2 none acc ↑ 0.6649 ± 0.0038
  • truthfulqa_mc2 2 none 0 acc ↑ 0.5257 ± 0.0153
  • winogrande 1 none 5 acc ↑ 0.7601 ± 0.0120 Qwen2-7B-Instruct-FP8

Tasks Version Filter n-shot Metric Value Stderr Open LLM Leaderboard N/A

  • arc_challenge 1 none 25 acc ↑ 0.5836 ± 0.0144 none 25 acc_norm ↑ 0.6212 ± 0.0142
  • gsm8k 3 flexible-extract 5 exact_match ↑ 0.7779 ± 0.0114 strict-match 5 exact_match ↑ 0.7066 ± 0.0125
  • hellaswag 1 none 10 acc ↑ 0.6097 ± 0.0049 none 10 acc_norm ↑ 0.8108 ± 0.0039
  • mmlu 2 none acc ↑ 0.7008 ± 0.0037
  • truthfulqa_mc2 2 none 0 acc ↑ 0.5695 ± 0.0154
  • winogrande 1 none 5 acc ↑ 0.7419 ± 0.0123 Mixtral-8x7B-Instruct-v0.1-FP8

Tasks Version Filter n-shot Metric Value Stderr Open LLM Leaderboard N/A

  • arc_challenge 1 none 25 acc ↑ 0.6664 ± 0.0138 none 25 acc_norm ↑ 0.6928 ± 0.0135
  • gsm8k 3 flexible-extract 5 exact_match ↑ 0.6270 ± 0.0133 strict-match 5 exact_match ↑ 0.6224 ± 0.0134
  • hellaswag 1 none 10 acc ↑ 0.6800 ± 0.0047 none 10 acc_norm ↑ 0.8718 ± 0.0033
  • mmlu 2 none acc ↑ 0.6981 ± 0.0036
  • truthfulqa_mc2 2 none 0 acc ↑ 0.6434 ± 0.0150
  • winogrande 1 none 5 acc ↑ 0.8256 ± 0.0107

Get numbers on FP16 models for comparison?

HaiShaw avatar Aug 15 '24 06:08 HaiShaw

@HaiShaw those numbers match up with the base fp16 and fp8 (on H100) evals I shared earlier for those models, so I think this is sufficient!

For reference: neuralmagic/Meta-Llama-3-8B-Instruct-FP8 neuralmagic/Qwen2-7B-Instruct-FP8 neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8

mgoin avatar Aug 15 '24 14:08 mgoin

@HaiShaw those numbers match up with the base fp16 and fp8 (on H100) evals I shared earlier for those models, so I think this is sufficient!

For reference: neuralmagic/Meta-Llama-3-8B-Instruct-FP8 neuralmagic/Qwen2-7B-Instruct-FP8 neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8

@mgoin , thanks for your confirmation! LGTM! @charlifu

HaiShaw avatar Aug 15 '24 15:08 HaiShaw

Looks good to me! It would be nice if you could add an FP8 model loading test to the AMD CI, so we are testing beyond just the kernel support

Thank you. We will add the test in a subsequent PR.

charlifu avatar Aug 15 '24 21:08 charlifu

@mgoin @robertgshaw2-neuralmagic additionally we plan to support https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 soon (we already supports https://huggingface.co/amd/Meta-Llama-3.1-405B-Instruct-fp8-quark-vllm but not from this PR), when FBGEMM-FP8 (dynamic per-token activations and per-channel weights) support is ready, FYI.

Hi @HaiShaw do you know how does ROCM6.2 support fp8 kernels in vLLM? Are there any examples or jobs work in progress about fp8 ops writing (GEMM) with ROCM6.2 SDK ?

@yiakwy-xpu-ml-framework-team ROCm6.2 supports fp8 natively via hipBLALt, triton and CK (not brought in vLLM use yet). Current MI300 FP8 format is somehow different than OCP format, we introduced max. ceiling and scaling factor adjustment to make sure it receives OCP standard FP8 data from external interfaces (checkpoints, etc.) and compute on native AMD HW. In terms of work or tasks, we welcome all kinds of discussion and collaborations 😄

HaiShaw avatar Sep 11 '24 00:09 HaiShaw

@yiakwy-xpu-ml-framework-team ROCm6.2 supports fp8 natively via hipBLALt, triton and CK (not brought in vLLM use yet). Current MI300 FP8 format is somehow different than OCP format, we introduced max. ceiling and scaling factor adjustment to make sure it receives OCP standard FP8 data from external interfaces (checkpoints, etc.) and compute on native AMD HW. In terms of work or tasks, we welcome all kinds of discussion and collaborations 😄

Glad to see this. I was just curious about fp8_max in fp8_e4m3fnuz:

240 for : 0b01111 111 = (-1)^S * 2^{e - bias}*1.f_2 = 2^{15 - 8} * (1+0.875)

or

224 for : 0b01111 110 = 2^{15 - 8} * (1+0.75)

And I am looking for more about fp8 features.