returnn icon indicating copy to clipboard operation
returnn copied to clipboard

Replicate async multi-GPU Horovod training with pure TF

Open albertz opened this issue 2 years ago • 0 comments

See our multi-GPU training doc: https://returnn.readthedocs.io/en/latest/advanced/multi_gpu.html

In case you do not have very fast direct connections between the GPUs (nvlink, only for the big professional cards), we always recommend async training. Specifically, sth like these settings:

horovod_reduce_type = "param"
horovod_param_sync_time_diff = 100.
horovod_dataset_distribution = "random_seed_offset"

The random_seed_offset is about dataset shuffling. Each dataset gets its own random seed and then no slicing or anything like that is used anymore, all workers just iterate normally over the whole dataset. This is completely independent and we can just use the same logic and code for any other multi-GPU implementation.

The other settings are pretty simple: After 100 seconds, all parameters are averaged between the workers. That's all. Otherwise they completely run independently.

Currently this is implemented in Horovod but I think TF provides functions to do just the same. tf.distributed is somewhat related but I think this is actually higher level. I am not really sure. See #296 for a related issue, which is more generic than this one here, as #296 also wants to cover other potential training schemes, but here it is very specifically only about the exact async multi-GPU training logic as we currently have it implemented with Horovod.

Also see the wiki about distributed TF: https://github.com/rwth-i6/returnn/wiki/Distributed-TensorFlow

Why? First, just to test it out and get some experience with tf.distributed or other related TF functions. But then, we also had some problems with Horovod training in the past (#314, #323) and maybe this solves it.

albertz avatar Oct 14 '22 08:10 albertz