Introduce distributed computations.
What feature would you like to see added? We have great single-node parallelization for our computations, the next logical step is then to have great multi-node computations so that we can really make the most out of computing clusters.
We have to distinguish at this stage between on node parallelism and mutli-node. My suggestion would be to use a framework such as Dask to distribute tasks across nodes while keeping TensorFlow for the on-node parallelism. This is already quite simple to implement as things are as we just need to add Dask arrays to the generators and demand that it cross nodes if possible over batches. The difficulty will come in multi-GPU use on single nodes as we will require the in-built TensorFlow operations.
One though would be to use the already built in batch/ensemble structure to handle this by considering the following philosophy:
- Batches can be split over nodes
- Ensembles are split within a node over cores and/or GPUs
Currently we have:
- Loop over batch
- Parallelism within a node but only on a single GPU
With this philosophy we simply where we need to consider distribution.
Update It is also possible to distribute over GPUs on a single machine using dask: https://developer.nvidia.com/blog/dask-tutorial-beginners-guide-to-distributed-computing-with-gpus-in-python/
I just would like to avoid dask trying to do the CPU distribution as TF already does that very well.
I also think dask might be a good way to go - I'm just leaving https://www.tensorflow.org/guide/distributed_training here because in theory that would also be an option. I think we should try different strategies and benchmark them outside of MDSuite and then make an educated choice on what works best. Might also be slightly related to #458 if the batching is modified.
Yes I think that when we look at #458 we should do this at the same time. I didn't think that of course the distribution methods can also be used to split over nodes... in that case I would ideally like to work with TF if possible but we would need a benchmark.
One issue with the tf distribution is, that it is aimed at ML training based on keras. I don't know how flexible it is for our code.
Reading through their blogs and docs on it it seems you can call a distributed strategy over a function. They have a section in the docs about a custom training loop with a distributed strategy which looks like it could align well.