Paddle [XPU]rewrite xpu amp implementation based on that of GPU

[XPU]rewrite xpu amp implementation based on that of GPU

Open runzhech opened this issue 1 year ago • 3 comments

PR types

Others

PR changes

OPs

Description

对比cuda和xpu的amp_kernel实现时发现两者有下列不同：

xpu先对fp16 check_nan，再将其转为fp32进行scale。而gpu先将fp16转成fp32，再进行check_nan。
xpu在发现存在nan后，会立即调用scale将其置零，并且在UpdateLossScalingKernel中还会调用xpu::constant将其再置零一次（此处应该是多余的操作）。而gpu发现nan后，仅会在后续的UpdateLossScalingKernel中置零。

这次PR对齐了GPU的实现，调用了融合算子check_finite_unscale，并省去了冗余的置零操作，对于精度及性能都会略有提升。

Feb 07 '24 10:02 runzhech

你的PR提交成功，感谢你对开源项目的贡献! 请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。 Your PR has been submitted. Thanks for your contribution! Please wait for the result of CI firstly. See Paddle CI Manual for details.

Feb 07 '24 10:02 paddle-bot[bot]

这个修改为什么对精度会有影响？

Feb 11 '24 07:02 QingshuChen

这个修改为什么对精度会有影响？

@QingshuChen 原先的逻辑是对grad先check inf再scale，scale之后有可能会出现inf值，后续会被用于weight更新，从而导致loss异常。这次修改为先scale再check inf，能避免这种情况。

Feb 12 '24 10:02 runzhech

Paddle Paddle copied to clipboard

[XPU]rewrite xpu amp implementation based on that of GPU

PR types

PR changes

Description

Paddle
Paddle copied to clipboard