IMS-Toucan
IMS-Toucan copied to clipboard
Running weight averaging during training seems to hang it
Hi,
I am running the training in one remote terminal and I am doing inference of the current model on another one. Sometimes I test the current model and to do so I run the weight averaging script.
Then I have noticed that later the training hangs in the following state :
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A | | 30% 21C P8 19W / 350W | 8814MiB / 24576MiB | 0% Default | | | | N/A
The VRAM is still used but the GPU is not used and cannot be "killed" by fuser -v /dev/nvidia*
.
Htop reports 11.1G/15.5G for Mem and 1.65G/2.00G for Swap and CPU is used to 25%.
When killing the "run_training" python process via htop the Mem falls to 1.41G and the swap to 378M.
I am not sure it running weight averaging during training can hang it (because the project page states that it should be run when training is complete).
Any hint appreciated!
I don't know of any connection between the weight averaging and the training, I have run the weight averaging many times while other training processes were ongoing and never encountered this issue. Can unfortunately not reproduce and I don't seen an obvious solution.
Ok thank you. Gonna try and find a reproducible snippet or procedure. Maybe with less workers it won't appear anymore (I haven't tried so far to do that again with 4 workers instead of the default 12).