FBGEMM Do FP8 rowwise bias addition in higher precision

Summary: Previously, when bias was used in our FP8 rowwise kernel, it was added to the accumulator in its native precision. For example, if the bias is bf16, we would do a bf16 + bf16 addition. However, it's a bit more efficient and a bit more accurate to leave the accumulator in fp32, cast the bias to fp32, then to an fp32 addition.

Differential Revision: D74408348

May 08 '25 16:05 jwfromm

Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
Latest commit	c6f491c6aeefa6fee36a951d6f16903ab4595212
Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/684b672f453c1c00084b2559
Deploy Preview	https://deploy-preview-4095--pytorch-fbgemm-docs.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.