MDSuite icon indicating copy to clipboard operation
MDSuite copied to clipboard

Introduce distributed computations.

Open SamTov opened this issue 3 years ago • 4 comments

What feature would you like to see added? We have great single-node parallelization for our computations, the next logical step is then to have great multi-node computations so that we can really make the most out of computing clusters.

We have to distinguish at this stage between on node parallelism and mutli-node. My suggestion would be to use a framework such as Dask to distribute tasks across nodes while keeping TensorFlow for the on-node parallelism. This is already quite simple to implement as things are as we just need to add Dask arrays to the generators and demand that it cross nodes if possible over batches. The difficulty will come in multi-GPU use on single nodes as we will require the in-built TensorFlow operations.

One though would be to use the already built in batch/ensemble structure to handle this by considering the following philosophy:

  1. Batches can be split over nodes
  2. Ensembles are split within a node over cores and/or GPUs

Currently we have:

  1. Loop over batch
  2. Parallelism within a node but only on a single GPU

With this philosophy we simply where we need to consider distribution.

Update It is also possible to distribute over GPUs on a single machine using dask: https://developer.nvidia.com/blog/dask-tutorial-beginners-guide-to-distributed-computing-with-gpus-in-python/

I just would like to avoid dask trying to do the CPU distribution as TF already does that very well.

SamTov avatar Jan 14 '22 10:01 SamTov