n2v
n2v copied to clipboard
Multi gpu training
This needs a bit more testing, but I think going multi-gpu is somewhat straight forward. Or did you try that already?
Almost there, apparently there is a problematic interplay of tf and keras: https://github.com/tensorflow/tensorflow/issues/30728 https://github.com/keras-team/keras/issues/13057 https://github.com/keras-team/keras/pull/13255 I need to check how to fix this.
done implementing multi-gpu training. I hope putting that into the constructor of N2V
was the right choice. I also added an example notebook derived from examples/2D/denoising2D_BSD68/BSD68_reproducibility_multi_gpu.ipynb.
I'll supply more extensive numbers later, my current estimate for training n2v
from this notebook is:
- a single P100 with tf
1.12
and keras2.2.4
: ~93 seconds per epoch after warm-up - double P100s with tf
1.12
and keras2.2.4
: ~56 seconds per epoch after warm-up
I'll provide 4 GPU numbers later. Note that this "improvement" is expected to be non-linear as keras internally does parallize the batches, so a batch size of 128
will be parallelized to 2 batches of 64
images. As discussed earlier this approach is currently not support with tf 1.14
and keras 2.2.{4,5}
due to the bugs mentioned above.
Would love to hear your feedback on this.
Thank you for this PR!
I have this on my to-do list, but wasn't able to get my hands on a multi-GPU system. I guess the cluster should work for testing.
Although I am very confident that it just works, I would like to test it as well :)
thanks for having a look. Last time I checked, all GPU configs with >=3 GPUs fail to run due to some problems with the keras data augmentations. Maybe this is leveraged by looking into bringing n2v
100% to tf.keras
?
Hi, I want to use 2 gpus for training. As explained in the notebook, I used the following config,
config = N2VConfig(X_train, unet_kern_size=3, unet_n_depth=3, unet_n_first = 64,
train_steps_per_epoch=int(dim[0] / 128), train_epochs=50, train_loss='mse',
batch_norm=True, train_num_gpus=2,
train_batch_size=64, n2v_perc_pix=1.0, n2v_patch_shape=(128,128),
n2v_manipulator='uniform_withCP', n2v_neighborhood_radius=5)
I have set CUDA_VISIBLE_DEVICES to 1,2 before running the training. I have used pip install n2v to install N2V. My TF-GPU is 1.14.1, keras 2.2.5, numpy 1.19.1
The training still uses 1 GPU. Please let me know what I am missing.
Hi @piby2,
This functionality is not part of the official N2V release yet.
If you would like to test it you would have to clone the fork psteinb/n2v
and checkout the branch multi_gpu_training
. Then you can run pip install .
from inside the git repo and this version will be installed.