keras-unet icon indicating copy to clipboard operation
keras-unet copied to clipboard

Satellite Unet in multi-gpu

Open LJ-20 opened this issue 5 years ago • 13 comments

Hello I wasn't able to run the Satellite Unet in multi-gpu. I didn't have this problem with the custom unet.

LJ-20 avatar Feb 12 '20 22:02 LJ-20

@LJ-20 can you share your experience? I cannot run custom unet with multi-gpu. I followed distributed training part in Tensorflow documentation, but no luck. It seems I need to refactor code and use custom distributed training (namely strategy.experimental_distribute_dataset).

muminoff avatar Feb 13 '20 03:02 muminoff

Hi @LJ-20 Honestly I haven't tested any of UNet implementations from this repo on multi gpu setup but in theory there shouldn't be any issues. You said you were able to run custom unet on multi gpu but it's not working for satellite unet which I find weird because there's no significant difference in implementation or dependencies between custom vs satellite. My educated guess would be that you either had some errors in your code or problems with allocating resources on GPU. Can you share the code you used, TF/Keras version and error msg? That way I might be able to help you out or at least investigate it. Same for you @muminoff, share the same information and I'll look into it. Thanks, Karol

karolzak avatar Feb 13 '20 18:02 karolzak

These are the lines for each of the codes: model = satellite_unet(input_shape=(256,256,3))

model = custom_unet( (256,256,3), num_classes=1, use_batch_norm=True, upsample_mode='deconv', use_dropout_on_upsampling=False, dropout=0.0, dropout_change_per_layer=0.0, filters=64, num_layers=4, output_activation='sigmoid')

with the command:

model = multi_gpu_model(model, gpus=4,cpu_relocation=True)

The implentation followed the tensorflow documentation. https://www.tensorflow.org/api_docs/python/tf/keras/utils/multi_gpu_model?version=stable

The error was the following: The accuracy would drop to 0.0e+00 and the loss would be constant at 0.3 after the first epoch. It would never improve even after one day of training. Also, this was not an issue when using 1 GPU or CPU. I used both tensorflow 1.14 and 1.15.

LJ-20 avatar Feb 13 '20 21:02 LJ-20

@karolzak I haven't tried tf.keras.utils.multi_gpu_model since it is deprecated. But, I tried with tf.distribute.MirroredStrategy().

And, here is my code:

from keras_unet.models import custom_unet
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam, SGD
from keras_unet.metrics import iou, iou_thresholded
from keras_unet.losses import jaccard_distance

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():

    input_shape = x_train[0].shape

    model = custom_unet(
        input_shape,
        filters=32,
        use_batch_norm=True,
        dropout=0.3,
        dropout_change_per_layer=0.0,
        num_layers=6
    )

    model.summary()

    model_filename = 'model-v2.h5'

    callback_checkpoint = ModelCheckpoint(
        model_filename, 
        verbose=1, 
        monitor='val_loss', 
        save_best_only=True,
    )

    model.compile(
        optimizer=Adam(), 
        #optimizer=SGD(lr=0.01, momentum=0.99),
        loss='binary_crossentropy',
        #loss=jaccard_distance,
        metrics=[iou, iou_thresholded]
    )

    history = model.fit_generator(
        train_gen,
        steps_per_epoch=200,
        epochs=50,
        validation_data=(x_val, y_val),
        callbacks=[callback_checkpoint]
    )

Error:

ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call.

muminoff avatar Feb 14 '20 00:02 muminoff

@karolzak fyi, using multi_gpu_model raises following exception:

ValueError: ('Expected `model` argument to be a `Model` instance, got ', <keras.engine.training.Model object at 0x7f1b347372d0>)

Edit: format

muminoff avatar Feb 14 '20 01:02 muminoff

These are the lines for each of the codes: model = satellite_unet(input_shape=(256,256,3)) model = custom_unet( (256,256,3), num_classes=1, use_batch_norm=True, upsample_mode='deconv', use_dropout_on_upsampling=False, dropout=0.0, dropout_change_per_layer=0.0, filters=64, num_layers=4, output_activation='sigmoid') with the command: model = multi_gpu_model(model, gpus=4,cpu_relocation=True) The implentation followed the tensorflow documentation. https://www.tensorflow.org/api_docs/python/tf/keras/utils/multi_gpu_model?version=stable The error was the following: The accuracy would drop to 0.0e+00 and the loss would be constant at 0.3 after the first epoch. It would never improve even after one day of training. Also, this was not an issue when using 1 GPU or CPU. I used both tensorflow 1.14 and 1.15.

@LJ-20 So it's not that you're not able to run satellite unet on multi gpu but more about it's not converging in multi gpu setup.. Hmm very interesting, thank you for brining that up! I will look into it and try debugging it however as of right now I don't see anything that could be causing this from the model implementation perspective. My best quess would be that there's something wrong with your input data. Check the following:

  • make sure you have your input data pixel values in 0-1 range
  • make sure your input dtype is set to float32
  • as a last resort try playing around with multi_gpu_model functions params (cpu_merge, etc.) and see if that helps

Let me know how did it go!

EDIT: Removed part related to another topic in this discussion which was moved to issue #14

karolzak avatar Feb 14 '20 02:02 karolzak

@karolzak

@muminoff, can you specify the version that you're using for TF/Keras? This seem to be related to that problem.

tf.__version__
'2.1.0'

keras.__version__
'2.3.1'

Hmm.. looking at your example I see you're using fit/fit_generator inside the scope of with strategy.scope(): - I'm not sure but this might be the root of your problem. With all the examples I've seen only building the model and compiling it was happening inside that scope whereas actual training was outside of it. Please try fixing the indent and let me know if that helps?

Fixed the indent, but same error:

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():

    input_shape = x_train[0].shape

    model = custom_unet(
        input_shape,
        filters=32,
        use_batch_norm=True,
        dropout=0.3,
        dropout_change_per_layer=0.0,
        num_layers=6
    )

model.summary()

model_filename = 'model-v2.h5'

callback_checkpoint = ModelCheckpoint(
    model_filename, 
    verbose=1, 
    monitor='val_loss', 
    save_best_only=True,
)

model.compile(
    optimizer=Adam(), 
    #optimizer=SGD(lr=0.01, momentum=0.99),
    loss='binary_crossentropy',
    #loss=jaccard_distance,
    metrics=[iou, iou_thresholded]
)

history = model.fit_generator(
    train_gen,
    steps_per_epoch=200,
    epochs=50,
    validation_data=(x_val, y_val),
    callbacks=[callback_checkpoint]
)
ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call.

muminoff avatar Feb 14 '20 02:02 muminoff

@karolzak I tried different methods as you described, left only building and compiling the model inside the strategy context, same error. The aboveshown snippet is the latest try. You can assume model.compile ing inside context also raises that exception.

muminoff avatar Feb 14 '20 02:02 muminoff

@muminoff I was able to reproduce your problem and debug it and it is related to Keras/tf.keras dependencies mismatch. I will introduce a fix for this problem in next PR but could you please create a separate issue for your problem as it is not directly related to @LJ-20 issue? Feel free to copy the content of your comments from this issue to your new issue. Thanks!

karolzak avatar Feb 14 '20 03:02 karolzak

@karolzak the pixel range is read from 0 to 1, the numpy arrays are dtype float32 and also tried the multi_gpu parameters. Our first thought was the batch normalization or the way the weights are merged in multi_gpu but we didn't have this problem with the custom_unet using the exact same code.

LJ-20 avatar Feb 18 '20 17:02 LJ-20

Update. Upon revision, it seems like the problem was the float32. I had it set up as float64. Why is this an issue?

LJ-20 avatar Feb 18 '20 19:02 LJ-20

@LJ-20 , so you used float64 for both custom_unet and satellite_unet? Or just the for satellite_unet? In general single precision (float32) is most commonly used (also its the default for TF and maybe that's the root of the problem?) and I haven't seen examples of float64 being used in any experiments. Half-precision (float16) on the other hand can be used in cases where you want to squeeze more data into memory

karolzak avatar Feb 18 '20 19:02 karolzak

I used float64 for both custom_unet and satellite_unet and it only worked with custom_unet

LJ-20 avatar Feb 18 '20 21:02 LJ-20