models Benchmark, improve and document mixed precision (AMP) support in Models

[ ] Train different models with mixed precision and inspect if it doesn't break the API, leads to better performance at similar accuracy without leading to numeric instabilities / NaN loss.
[ ] Benchmark speed-up of AMP x TF32 (with A100 or superior)
[ ] Document to our users how and when to use AMP with Merlin Models

Notes from @vysarge preliminary experiments reported in this spreadsheet (Nvidia internal only):

With AMP on (via keras mixed precision), MM iteration is actually slower by 35.6ms The majority of this time comes from one block of ops, including calls to SetToValue, IsFinite, DeviceReduceKernel, and UnsortedSegmentCustomKernel, which are contributing a combined 37.9ms of GPU time Etiology is unclear; these kernels appear to have one call per categorical feature, and to take longer for features with a high cardinality Without these calls I would expect AMP to be saving ~7ms

Then she wrote

AMP issues appear to be related to loss scaling. The ~35ms slowdown from the previous email is present when not scaling losses or when using keras.mixed_precision.LossScaleOptimizer with the default dynamic=True. Using keras.mixed_precision.LossScaleOptimizer with dynamic=False instead, AMP does indeed save ~7ms off the training iteration time. (See also nvbugs/3980579 tracking a similar issue.)

Mar 24 '23 15:03 gabrielspmoreira

Comment by @vysarge

a fix for part of the AMP slowdown as described in nvbugs/3980579 has been recently accepted into Keras (PR link).

Mar 27 '23 19:03 gabrielspmoreira

Is this issue still open? Is there any example on how to use mixed_precision for training or should we simple follow the standard Tensorflow/Keras solution to do that?

Jun 28 '24 22:06 CarloNicolini