training_policies icon indicating copy to clipboard operation
training_policies copied to clipboard

Investigate allowing async training for DLRM

Open bitfort opened this issue 5 years ago • 5 comments

This is a longer term issue to explore. Large models, such as a large recommender like DLRM, can benefit from async parts of the pipeline. These are optimizations commonly used to make large models performant in production; "the math changes on paper but not in practice". Some practitioners think this is an important optimization for certain models. It isn't clear how this fits into the rules for MLPerf, specifically closed division. It could make sense to allow this in closed because it is a relatively "vanilla" optimization, but it isn't how rules would be structured for this.

bitfort avatar Jan 16 '20 16:01 bitfort

AI(Tayo) -- sync with model owners on this idea.

bitfort avatar Jan 16 '20 16:01 bitfort

This is not allowed by the current rules - there was a ruling on stale gradients, which is what this would be: https://github.com/mlperf/training_policies/issues/36

nvpaulius avatar Jan 16 '20 22:01 nvpaulius

Indeed. This is specifically a proposal to allow asynchrony for the embedding lookups of DLRM (not allowed by the current rules); all other updates would still be required to be synchronous.

robieta avatar Jan 16 '20 23:01 robieta

This was discussed in the Special Topics meeting on January 30.

There were no objections to allowing this for DLRM. The proposer needs to make the case for this and have this agreed upon before the HParam deadline. The case should take the form of general proof that such overlapping is used, e.g. papers, testimonials.

tayo avatar Jan 30 '20 18:01 tayo

Postponed until next round.

bitfort avatar Jun 11 '20 16:06 bitfort