deepxde icon indicating copy to clipboard operation
deepxde copied to clipboard

Add support for Horovod (tensorflow)

Open pescap opened this issue 2 years ago • 4 comments

Work in progress. Add the hvd support, to use multiple GPUs. Sounds to be working properly (tested with 1 up to 8 GPUs) and is non-intrusive! Later, we could add more the hvd-accelerated tf.compat.v1 and pytorch backends.

To do:

  • [ ] Avoid multiple imports of import horovod.tensorflow as hvd. I have to understand how to import horovod only once. I wonder where I shall put this import. I do not want the code to depend on hvd, if one does not want to use multiple GPU acceleration.
  • [ ] Define properly the collocation points. The idea would be to generate independent samples for each GPU. This is not done properly so far. With the current code, I am sending the same points to all the GPUs, and I add a resampling process resampler = dde.callbacks.PDEResidualResampler(period= epochs // hvd.size()). To my opinion, the most simple option would be to set properly the mini_batch option (see #320 )... Do you agree?

Tested with the 2D Helmhlotz case (notice that I only added the resampler and dde.config.set_hvd()):

"""Backend supported: tensorflow.compat.v1, tensorflow, pytorch"""
import deepxde as dde
import numpy as np

dde.config.set_hvd()

# Set seed on 11
dde.config.set_random_seed(11)
# General parameters
n = 2
precision_train = 10
precision_test = 30
hard_constraint = True
weights = 100  # if hard_constraint == False
epochs = 5000
parameters = [1e-3, 3, 150, "sin"]

# Define sine function
if dde.backend.backend_name == "pytorch":
    sin = dde.backend.pytorch.sin
else:
    from deepxde.backend import tf

    sin = tf.sin

learning_rate, num_dense_layers, num_dense_nodes, activation = parameters


def pde(x, y):
    dy_xx = dde.grad.hessian(y, x, i=0, j=0)
    dy_yy = dde.grad.hessian(y, x, i=1, j=1)

    f = k0 ** 2 * sin(k0 * x[:, 0:1]) * sin(k0 * x[:, 1:2])
    return -dy_xx - dy_yy - k0 ** 2 * y - f


def func(x):
    return np.sin(k0 * x[:, 0:1]) * np.sin(k0 * x[:, 1:2])


def transform(x, y):
    res = x[:, 0:1] * (1 - x[:, 0:1]) * x[:, 1:2] * (1 - x[:, 1:2])
    return res * y


def boundary(_, on_boundary):
    return on_boundary


geom = dde.geometry.Rectangle([0, 0], [1, 1])
k0 = 2 * np.pi * n
wave_len = 1 / n

hx_train = wave_len / precision_train
nx_train = int(1 / hx_train)

hx_test = wave_len / precision_test
nx_test = int(1 / hx_test)

if hard_constraint == True:
    bc = []
else:
    bc = dde.icbc.DirichletBC(geom, lambda x: 0, boundary)


data = dde.data.PDE(
    geom,
    pde,
    bc,
    num_domain=nx_train ** 2,
    num_boundary=4 * nx_train,
    solution=func,
    num_test=nx_test ** 2,
)

net = dde.nn.FNN(
    [2] + [num_dense_nodes] * num_dense_layers + [1], activation, "Glorot uniform"
)

if hard_constraint == True:
    net.apply_output_transform(transform)

model = dde.Model(data, net)

if hard_constraint == True:
    model.compile("adam", lr=learning_rate, metrics=["l2 relative error"])
else:
    loss_weights = [1, weights]
    model.compile(
        "adam",
        lr=learning_rate,
        metrics=["l2 relative error"],
        loss_weights=loss_weights,
    )
    
import horovod.tensorflow as hvd    
resampler = dde.callbacks.PDEResidualResampler(period= epochs // hvd.size())
losshistory, train_state = model.train(epochs=epochs, callbacks = [resampler])

pescap avatar May 25 '22 14:05 pescap

#579 #39

pescap avatar May 25 '22 14:05 pescap

Great. Supporting multiple GPU training is nontrivial. Currently I don't have a good solution for collocation points. What is your idea of "mini_batch"?

lululxvi avatar May 25 '22 18:05 lululxvi

FYI: https://github.com/horovod/horovod/issues/114

I am currently working on it. The first milestone would be to dispatch the training points over the GPUs. Maybe via a data generator. As a starting point, one could partition the domain points over the GPUs. For the moment, the bc points would be the same for all the GPUs.

Example for the Helmholtz 2D case. Hard_constraint (for the sake of simplicity). data.train_points().shape = (480, 2)

We have 480 points. We could partition the 480 over the GPUs. Each GPU gradients' train over: 480 // hvd.size() GPUs.

For example, for 2 GPUs, each one is with 240 domain points. This partition is very close to the mini_batch idea.

I was thinking about defining the GPU-dependent data here: https://github.com/pescap/deepxde/blob/80d2b60e4ee2de823c48dedaf7ed4bfb1f362d40/deepxde/model.py#L551

Next, as soon as the horovod architecture is deployed, one could go towards domain decomposition-based solutions.

Refs: https://github.com/horovod/horovod/blob/a0cd0af215c4396033f0e0fadefddf585a2b079a/examples/tensorflow2/tensorflow2_mnist.py#L86

pescap avatar May 25 '22 21:05 pescap

Major issue:

So far, it sounds like the weights are not properly shared on all GPUs. This is a pending issue.

With the current state, I expect random data to be generated independently on each GPUs. I am following the most simple implementation, which would train the model over N_GPU * data.

Additional temporary notes:

Train next batch for PDEs: https://github.com/lululxvi/deepxde/blob/770e7c1f703682633fe182a6de984987fd579afa/deepxde/data/pde.py#L168

Discussion concerning mini-batch, and resampling: https://github.com/lululxvi/deepxde/issues/175

https://github.com/lululxvi/deepxde/issues/39 "During training, we call train_next_batch for each SGD iteration. Currently, I just ignore batch_size, and return the whole dataset."

pescap avatar May 26 '22 13:05 pescap

I'll start from the beginning, as this PR is too outdated. A quick update:

  • I am more familiar with tf.compat.v1 and it has more models implemented. Therefore, I will implement the Horovod acceleration for this backend first (and not tensorflow, as for this original draft)
  • At the end, the data point generation is straightforward. One uses random o pseudo-random generation, and each GPU is assigned its own data. Uniform distribution would lead to to the same training points over all GPUs...
  • I keep accelerating the domain collocation points. I assume that the boundary points do not need to be distributed over GPU.

pescap avatar Mar 24 '23 20:03 pescap

@pescap You can also check https://github.com/lululxvi/deepxde/pull/1094, although it is only for backend paddle, but it has many useful functions.

lululxvi avatar Mar 24 '23 20:03 lululxvi

@pescap You can also check #1094, although it is only for backend paddle, but it has many useful functions.

Thank you, I'll try to follow the same structure.

pescap avatar Mar 24 '23 20:03 pescap