UniAD icon indicating copy to clipboard operation
UniAD copied to clipboard

Error during train stage2_e2e

Open DeepBehavier opened this issue 10 months ago • 4 comments

I used 6 A100 to train the model of stage2_e2e. After completing the fourth epoch, an error occurred.Error as shown below。 1 2 This mistake is easy to repeat. Please help me to solve this problem.

DeepBehavier avatar Apr 03 '24 01:04 DeepBehavier

Have you solved the problem?

7bbjungle avatar Apr 18 '24 07:04 7bbjungle

same error Have you solved the problem

duanmushuangquan avatar Jun 24 '24 05:06 duanmushuangquan

same issue occured. Did you solve it?

generalchan825 avatar Sep 23 '24 05:09 generalchan825

Hi @DeepBehavier @7bbjungle @duanmushuangquan @generalchan825. It happens occasionally during training but we did find the cause of this issue unfortunately. A recommended workaround is to resume the checkpoint of the 4th epoch and continue training for the next epochs.

YTEP-ZHI avatar Sep 23 '24 07:09 YTEP-ZHI