nnUNet icon indicating copy to clipboard operation
nnUNet copied to clipboard

Epoch time doubles when running separate tasks on different GPUs

Open bucsab12 opened this issue 3 years ago • 5 comments

Hi,

I am running the winning network of the BraTS 2021 challenge (nnUNet with tweaks) on a shared windows machine with 4 Tesla V100 GPUs and 256GB RAM. The code can be found here:

https://github.com/rixez/Brats21_KAIST_MRI_Lab

This is the running command that I am using:

3d_fullres nnUNetTrainerV2BraTSRegions_DA4_BN_BD_largeUnet_Groupnorm 1 0 --npz

When I run a single run on a single GPU, the running time per epoch is approximately 450 seconds. When I run a different task on another GPU, epoch running time becomes almost twice as long at approximately 800 seconds per epoch. It seems like the two runs are sharing resources, probably the CPU, but there is enough RAM on the machine so it shouldn't be an issue.

Any help will be greatly appreciated. Thank you very much in advance.

bucsab12 avatar Jun 26 '22 06:06 bucsab12

same behavior here with a DGX Station with four A100's and 512GB RAM with 128 Thread AMD EPYC Processor. However, I am using the standard version of the nnUNet

Nanex101195 avatar Jun 26 '22 21:06 Nanex101195

It is very strange. I still wasn't able to figure it out. I also noticed that when another person is running a completely different job, not related to nnUNet, on another GPU on the same machine, it also has the same effect.

bucsab12 avatar Jun 26 '22 22:06 bucsab12

Guys you are using expensive machines so you should also know how to monitor their resource usage. If the CPU is full it is full and there is unfortunately not much to be done about it. If the CPU is not full and you have RAM available, you can use nnUNet_n_proc_DA to increase the number of data augmentation workers (which is typically the bottleneck for BraTS due to the large number of modalities). On our A100 servers I like to use nnUNet_n_proc_DA=28 or 32 (if the CPU allows this). On V100 I would recommend 24 or so. Best, Fabian

FabianIsensee avatar Aug 23 '22 10:08 FabianIsensee

Also read and run this: https://github.com/MIC-DKFZ/nnUNet/blob/a4af4c05be482b764131d38fae5308ace63eb39f/documentation/expected_epoch_times.md#results

FabianIsensee avatar Aug 23 '22 10:08 FabianIsensee

Thank you very much Fabian for your advice and this amazing code. I will make sure to apply your suggestion the next time I train a model and I will update once I do.

bucsab12 avatar Aug 24 '22 06:08 bucsab12