PF-AFN icon indicating copy to clipboard operation
PF-AFN copied to clipboard

How long does training take with default training settings?

Open mesllo opened this issue 2 years ago • 2 comments

If I train with 8 GPUs, a batch size 4, and 100 epochs (50 for normal lr, 50 for decaying lr), how long would it take? I'm asking because I only have 4 GPUs and I am using a shared environment where I won't be able to train for more than 8 hours per run.

mesllo avatar May 05 '22 17:05 mesllo

The complete training has four stages, the third stage has 200 epochs and takes the longest. The training code prints the estimated time, you can try it

hanchaoyuan avatar May 06 '22 02:05 hanchaoyuan

Thank you for your reply! I have tried it and even for the first stage, training takes about 24 hours on 4 GPUs which is not feasible for me. I can also try to take a multi-node approach and run on 8 GPUs across 2 nodes. Is the code already optimized for multiple nodes or do I have to do this myself? From what I can understand, it seems that the current code only considers a single node.

mesllo avatar May 06 '22 12:05 mesllo