Slow epoch time and intermittent GPU usage for 9-channel 3D images

Open ghezzis opened this issue 2 years ago • 1 comments

Hello Fabian and all collaborators of this project 😀

First of all, I would like to thank you for this amazing framework.

Secondly, I would like to ask help since I am experiencing slow training and I would like to undestand why.

Unluckily, I am still using the v1. We started before the v2 and since the trained models could not be used with this latter, we kept the old version. Therefore, I cannot use the benchmark command to understand what is the bottleneck. Installing the v2 (even on a separate env) breaks all my running v1 training and seems a bit tricky at the moment.

I am performing two trainings on two parallel GPUs (using CUDA_VISIBLE_DEVICES=0 and 1) of 3D biomedical images with 9 channels. When parallel training, both gpus are working (seen from nvidia-smi) and epoch time is about 380/400 seconds. When training only on one GPU, epoch time is about 280 seconds. Watching nvidia-smi, it seems that the GPU is working "intermittently": it works for a while, then one long moment does not work, then it works again...

Some time ago, we trained 3D images with same dimensions with only one channel and this issue was not so evident. I was wondering what might be the problem.

I was wondering if the CPUs could not be enough for the 9 channels we are using. Hereafter, the result of "lscpu" command:

Another possibility, might be slow data loading. Unluckily, I am performing training inside a proxmox container and I have no visibility of the server it is mounted on. Output of "lsbk -d": I read somewhere that since I am inside a proxmox container, the rotational param might not be correct.

Finally, I was wondering if the RAM is simply not enough. I have 62 GB RAM and while training on one GPU, we have always about 40 GB full. While training on both GPUS, it is completely full the whole time.

In order to make the training faster, I tried :

using the OMP_NUM_THREADS=1 param with about 20 seconds of improvement per epoch
setting "export nnUNet_n_proc_DA=17"
checked that I was not doing augmentation: I should not be doing it since I am using my custom trainer that inherits from nnUNetTrainerV2 and does not do DA. is it correct or are there other places to check for DA?

They caused a slight improvement in epoch time. Is there any other possibility to lower training time? For example loading data from disk to RAM before training so that loading is faster?

Thank you very much for your attention.

Dec 09 '23 09:12 ghezzis

hi, do you get a good method to address this problem?

Jan 10 '24 07:01 QianLingjun