Parts2Whole
Parts2Whole copied to clipboard
Problem when resuming from the previous checkpoints
Hi,
I encounter a problem when I try to resume from the checkpoint and want to continue the training.
The training program is always stuck when resumed from the checkpoint.
For example, here is the screenshot when I try to resume from my checkpoint-6300 with 8 GPUs:
In the beginning, after resuming from the checkpoint, the training process skipped some iterations to match the resume_step.
However, after reaching the resume step, the training process is stuck as the following screenshot:
This is another example I just let the training program run and got these errors:
Have you encountered such a problem, or do you have any idea about this?
Thanks for your time and help!