UniAD
UniAD copied to clipboard
Error during train stage2_e2e
I used 6 A100 to train the model of stage2_e2e.
After completing the fourth epoch, an error occurred.Error as shown below。
This mistake is easy to repeat.
Please help me to solve this problem.
Have you solved the problem?
same error Have you solved the problem
same issue occured. Did you solve it?
Hi @DeepBehavier @7bbjungle @duanmushuangquan @generalchan825. It happens occasionally during training but we did find the cause of this issue unfortunately. A recommended workaround is to resume the checkpoint of the 4th epoch and continue training for the next epochs.