IMS-Toucan icon indicating copy to clipboard operation
IMS-Toucan copied to clipboard

Running weight averaging during training seems to hang it

Open Ca-ressemble-a-du-fake opened this issue 1 year ago • 2 comments

Hi,

I am running the training in one remote terminal and I am doing inference of the current model on another one. Sometimes I test the current model and to do so I run the weight averaging script.

Then I have noticed that later the training hangs in the following state :

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A | | 30% 21C P8 19W / 350W | 8814MiB / 24576MiB | 0% Default | | | | N/A The VRAM is still used but the GPU is not used and cannot be "killed" by fuser -v /dev/nvidia*.

Htop reports 11.1G/15.5G for Mem and 1.65G/2.00G for Swap and CPU is used to 25%.

When killing the "run_training" python process via htop the Mem falls to 1.41G and the swap to 378M.

I am not sure it running weight averaging during training can hang it (because the project page states that it should be run when training is complete).

Any hint appreciated!

Ca-ressemble-a-du-fake avatar Mar 07 '23 15:03 Ca-ressemble-a-du-fake

I don't know of any connection between the weight averaging and the training, I have run the weight averaging many times while other training processes were ongoing and never encountered this issue. Can unfortunately not reproduce and I don't seen an obvious solution.

Flux9665 avatar Mar 08 '23 15:03 Flux9665

Ok thank you. Gonna try and find a reproducible snippet or procedure. Maybe with less workers it won't appear anymore (I haven't tried so far to do that again with 4 workers instead of the default 12).

Ca-ressemble-a-du-fake avatar Mar 08 '23 20:03 Ca-ressemble-a-du-fake