feat: FP8 Rowwise quantization support for Cohere models
This adds FP8 support for the LayerNorm kernel in the same way as was done for the RmsNorm kernel, which then allows us to use FP8 Rowwise quantization with the Cohere models.
For previous discussion, see https://github.com/NVIDIA/TensorRT-LLM/issues/2912
/bot run
@QiJune @ming-wei pls help review this MR.
/bot run
PR_Github #672 [ run ] triggered by Bot
PR_Github #672 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #565 completed with status: 'FAILURE'
It looks like the CI failed, but the links go to some internal domains, so I can't see what the error is. I have some ideas what it might be... I probably need to update other usages of the LayerNorm quantization plugin to handle the new parameters.
- blossom-ci
@aikitoria you code failed to pass the pre-commit check.
Currently the pre-commit check failure will not be copied back to public to be viewable and we are working to improve it with this MR:
For now I just manually copy the error message to provide quick feedback:
You can also refer here to do the pre-commit check locally in your own dev environment.
June
Oh I see, I will fix the formatting for both PRs
Thank you for the contribution!
I've left a few comments, but the PR looks overall good.
@juney-nvidia It'd be good if we can find someone familiar with quantization support. I personally don't have hands-on quantization experience, so I might miss something.
Sure, I just added @Tracin into the code reviewer loop.
Thanks June
Hi @aikitoria , would you mind adding an functional unittest like tests/unittest/trt/quantization/test_smooth_quant_layer_norm.py? And it would be better to add an example usage in examples/commandr/README.md. Thanks.
@aikitoria any update on this?
Sorry, I have been busy at work, I will come back to this this week!
Edit: Still haven't been able to get to it
Closing as no updates from requester for +10 days. Feel free to reopen when you have some bandwidth to work on it!