CogVideo icon indicating copy to clipboard operation
CogVideo copied to clipboard

Tensorboard logs seem to fail when multi-gpu training

Open snowflakewang opened this issue 8 months ago • 2 comments

When I train the model with a single GPU on my local machine, the tensorboard goes well and is able to show loss curve and so on. However, when I train the model with multiple GPUs, the tensorboard seems not to record anything (the size of tensorboard files are very small, ~KB). I wonder if this is a multi-gpu training bug, thank you.

snowflakewang avatar Apr 25 '25 06:04 snowflakewang

It's quite strange because we didn't seem to encounter this issue during multi-GPU training. We will attempt to reproduce it later. We recommend you switch to cogkit for training first, as we have now shifted our maintenance of cogvideo training to cogkit, which offers better training efficiency and usability.

OleehyO avatar May 12 '25 02:05 OleehyO

It's quite strange because we didn't seem to encounter this issue during multi-GPU training. We will attempt to reproduce it later. We recommend you switch to cogkit for training first, as we have now shifted our maintenance of cogvideo training to cogkit, which offers better training efficiency and usability.

Thank you for your valuable guidance! I will try the cogkit!

snowflakewang avatar May 12 '25 06:05 snowflakewang