Benchmark, improve and document mixed precision (AMP) support in Models
- [ ] Train different models with mixed precision and inspect if it doesn't break the API, leads to better performance at similar accuracy without leading to numeric instabilities / NaN loss.
- [ ] Benchmark speed-up of AMP x TF32 (with A100 or superior)
- [ ] Document to our users how and when to use AMP with Merlin Models
Notes from @vysarge preliminary experiments reported in this spreadsheet (Nvidia internal only):
With AMP on (via keras mixed precision), MM iteration is actually slower by 35.6ms The majority of this time comes from one block of ops, including calls to SetToValue, IsFinite, DeviceReduceKernel, and UnsortedSegmentCustomKernel, which are contributing a combined 37.9ms of GPU time Etiology is unclear; these kernels appear to have one call per categorical feature, and to take longer for features with a high cardinality Without these calls I would expect AMP to be saving ~7ms
Then she wrote
AMP issues appear to be related to loss scaling. The ~35ms slowdown from the previous email is present when not scaling losses or when using keras.mixed_precision.LossScaleOptimizer with the default dynamic=True. Using keras.mixed_precision.LossScaleOptimizer with dynamic=False instead, AMP does indeed save ~7ms off the training iteration time. (See also nvbugs/3980579 tracking a similar issue.)
Comment by @vysarge
a fix for part of the AMP slowdown as described in nvbugs/3980579 has been recently accepted into Keras (PR link).
Is this issue still open?
Is there any example on how to use mixed_precision for training or should we simple follow the standard Tensorflow/Keras solution to do that?