Parts2Whole icon indicating copy to clipboard operation
Parts2Whole copied to clipboard

Problem when resuming from the previous checkpoints

Open LIAGM opened this issue 6 months ago • 0 comments

Hi,

I encounter a problem when I try to resume from the checkpoint and want to continue the training.

The training program is always stuck when resumed from the checkpoint.

For example, here is the screenshot when I try to resume from my checkpoint-6300 with 8 GPUs:

In the beginning, after resuming from the checkpoint, the training process skipped some iterations to match the resume_step.

image

However, after reaching the resume step, the training process is stuck as the following screenshot:

image

This is another example I just let the training program run and got these errors: image

Have you encountered such a problem, or do you have any idea about this?

Thanks for your time and help!

LIAGM avatar Jul 31 '24 00:07 LIAGM