IMS-Toucan Running weight averaging during training seems to hang it

Running weight averaging during training seems to hang it

Open Ca-ressemble-a-du-fake opened this issue 1 year ago • 2 comments

Hi,

I am running the training in one remote terminal and I am doing inference of the current model on another one. Sometimes I test the current model and to do so I run the weight averaging script.

Then I have noticed that later the training hangs in the following state :

Htop reports 11.1G/15.5G for Mem and 1.65G/2.00G for Swap and CPU is used to 25%.

When killing the "run_training" python process via htop the Mem falls to 1.41G and the swap to 378M.

I am not sure it running weight averaging during training can hang it (because the project page states that it should be run when training is complete).

Any hint appreciated!

Mar 07 '23 15:03 Ca-ressemble-a-du-fake

I don't know of any connection between the weight averaging and the training, I have run the weight averaging many times while other training processes were ongoing and never encountered this issue. Can unfortunately not reproduce and I don't seen an obvious solution.

Mar 08 '23 15:03 Flux9665

Ok thank you. Gonna try and find a reproducible snippet or procedure. Maybe with less workers it won't appear anymore (I haven't tried so far to do that again with 4 workers instead of the default 12).

Mar 08 '23 20:03 Ca-ressemble-a-du-fake

IMS-Toucan IMS-Toucan copied to clipboard

Running weight averaging during training seems to hang it

IMS-Toucan
IMS-Toucan copied to clipboard