returnn
returnn copied to clipboard
Replicate async multi-GPU Horovod training with pure TF
See our multi-GPU training doc: https://returnn.readthedocs.io/en/latest/advanced/multi_gpu.html
In case you do not have very fast direct connections between the GPUs (nvlink, only for the big professional cards), we always recommend async training. Specifically, sth like these settings:
horovod_reduce_type = "param"
horovod_param_sync_time_diff = 100.
horovod_dataset_distribution = "random_seed_offset"
The random_seed_offset
is about dataset shuffling. Each dataset gets its own random seed and then no slicing or anything like that is used anymore, all workers just iterate normally over the whole dataset. This is completely independent and we can just use the same logic and code for any other multi-GPU implementation.
The other settings are pretty simple: After 100 seconds, all parameters are averaged between the workers. That's all. Otherwise they completely run independently.
Currently this is implemented in Horovod but I think TF provides functions to do just the same. tf.distributed
is somewhat related but I think this is actually higher level. I am not really sure. See #296 for a related issue, which is more generic than this one here, as #296 also wants to cover other potential training schemes, but here it is very specifically only about the exact async multi-GPU training logic as we currently have it implemented with Horovod.
Also see the wiki about distributed TF: https://github.com/rwth-i6/returnn/wiki/Distributed-TensorFlow
Why? First, just to test it out and get some experience with tf.distributed
or other related TF functions. But then, we also had some problems with Horovod training in the past (#314, #323) and maybe this solves it.