training_policies
training_policies copied to clipboard
Investigate allowing async training for DLRM
This is a longer term issue to explore. Large models, such as a large recommender like DLRM, can benefit from async parts of the pipeline. These are optimizations commonly used to make large models performant in production; "the math changes on paper but not in practice". Some practitioners think this is an important optimization for certain models. It isn't clear how this fits into the rules for MLPerf, specifically closed division. It could make sense to allow this in closed because it is a relatively "vanilla" optimization, but it isn't how rules would be structured for this.
AI(Tayo) -- sync with model owners on this idea.
This is not allowed by the current rules - there was a ruling on stale gradients, which is what this would be: https://github.com/mlperf/training_policies/issues/36
Indeed. This is specifically a proposal to allow asynchrony for the embedding lookups of DLRM (not allowed by the current rules); all other updates would still be required to be synchronous.
This was discussed in the Special Topics meeting on January 30.
There were no objections to allowing this for DLRM. The proposer needs to make the case for this and have this agreed upon before the HParam deadline. The case should take the form of general proof that such overlapping is used, e.g. papers, testimonials.
Postponed until next round.