tensorflow-allreduce icon indicating copy to clipboard operation
tensorflow-allreduce copied to clipboard

Introduce MPI allreduce in a new contrib project.

Open gibiansky opened this issue 7 years ago • 3 comments

This commit adds the tensorflow.contrib.mpi namespace and contrib project, which has a variety of ops that work with MPI.

The MPI system works by starting a background thread which communicates between the different processes at a regular interval and schedules asynchronous reductions. At every tick, every rank will notify rank zero of the tensors it is ready to reduce, signifying completion with an empty DONE message. Rank zero will count how many ranks are ready to reduce every tensor, and, whenever a tensor is ready to reduce (that is, every rank is ready to reduce it), rank zero will issue a message to all other ranks directing them to reduce that tensor. This repeats for all the tensors that are ready to reduce, after which rank zero sends all other ranks a DONE message indicating that the tick is complete.

gibiansky avatar Mar 01 '17 22:03 gibiansky

@gibiansky - what performance or else benefit do you observe with this feature? Thanks

mike-dubman avatar Mar 03 '17 09:03 mike-dubman

@miked-mellanox We have more info and speedup comparisons in the blog post, along with performance / scaling plots.

gibiansky avatar Mar 07 '17 23:03 gibiansky

@gibiansky the link doesn’t show anything right now

yunjiangster avatar Apr 05 '19 20:04 yunjiangster