Epoch time doubles when running separate tasks on different GPUs
Hi,
I am running the winning network of the BraTS 2021 challenge (nnUNet with tweaks) on a shared windows machine with 4 Tesla V100 GPUs and 256GB RAM. The code can be found here:
https://github.com/rixez/Brats21_KAIST_MRI_Lab
This is the running command that I am using:
3d_fullres nnUNetTrainerV2BraTSRegions_DA4_BN_BD_largeUnet_Groupnorm 1 0 --npz
When I run a single run on a single GPU, the running time per epoch is approximately 450 seconds. When I run a different task on another GPU, epoch running time becomes almost twice as long at approximately 800 seconds per epoch. It seems like the two runs are sharing resources, probably the CPU, but there is enough RAM on the machine so it shouldn't be an issue.
Any help will be greatly appreciated. Thank you very much in advance.
same behavior here with a DGX Station with four A100's and 512GB RAM with 128 Thread AMD EPYC Processor. However, I am using the standard version of the nnUNet
It is very strange. I still wasn't able to figure it out. I also noticed that when another person is running a completely different job, not related to nnUNet, on another GPU on the same machine, it also has the same effect.
Guys you are using expensive machines so you should also know how to monitor their resource usage. If the CPU is full it is full and there is unfortunately not much to be done about it.
If the CPU is not full and you have RAM available, you can use nnUNet_n_proc_DA to increase the number of data augmentation workers (which is typically the bottleneck for BraTS due to the large number of modalities).
On our A100 servers I like to use nnUNet_n_proc_DA=28 or 32 (if the CPU allows this). On V100 I would recommend 24 or so.
Best,
Fabian
Also read and run this: https://github.com/MIC-DKFZ/nnUNet/blob/a4af4c05be482b764131d38fae5308ace63eb39f/documentation/expected_epoch_times.md#results
Thank you very much Fabian for your advice and this amazing code. I will make sure to apply your suggestion the next time I train a model and I will update once I do.