Investigate and document the scenarios when LazyAdam should be used instead of Adam optimizer (experimental and new version in TF 2.11)
- [ ] Benchmark the usage of LazyAdam optimizer for the whole model and just for the embeddings (example here) and compare it to using the old Adam optimizer (TF <= 2.10) vs new Adam optimizer (in TF versions < 2.11 this is keras.optimizers.experimental.Adam and the default Adam for TF=2.11)
- [ ] Document when users should use LazyAdam, in which parts of the model and how that affects runtime and accuracy
Starting point
This example demonstrates how to use LazyAdam with MM, which was inspired by this post, and LazyAdam for sparse embeddings and regular Adam for the rest of the network’s dense parameters. But remaining questions above need to be addressed.
This task was originated by a profiling done by @vysarge using DLRM model and synthetic data reported in this spreadsheet (Nvidia internal only). Here are her comments on it
Tried a few different things with the optimizer: LazyAdam is significantly faster, at 38ms per iteration (vs 68ms with the original Adam). This brings end-to-end time much closer to the 35ms measured with the SGD optimizer. The TensorFlow Addons and Merlin Models implementations have similar performance end-to-end. The experimental/new version of the Adam optimizer is faster but still falls behind LazyAdam (49ms per iteration). In TF versions < 2.11 this is keras.optimizers.experimental.Adam As of 2.11 this will become the default keras.optimizers.Adam Worth noting that all optimizer tests done so far use the same optimizer for all parts of the model. A more realistic test would be to use Adam for the MLP portion and LazyAdam for the embedding portion, perhaps with Merlin’s MultiOptimizer, but I ran out of time to try this today. Overall the Adam optimizer version is the largest factor contributing to the original times.