The training can not converge and the value of grad_norm is nan.

Open sjtuljw520 opened this issue 1 year ago • 1 comments

Hi, thank you for sharing the code. I try to train my model with config file "configs/tracking/petr/f1_q500_800x320.py" and "configs/tracking/petr/f3_q500_800x320.py", but both the training of first stage (with f1_q500_800x320.py) and second stage (with f3_q500_800x320.py) can not converge. specially, the grad_norm becomes nan during traning.

you can see the traning log in the link below. Can you help to what happens here. Maybe there are some mistakes in the config file? https://github.com/sjtuljw520/papers_and_others/blob/main/traning_log_first_stage.log https://github.com/sjtuljw520/papers_and_others/blob/main/traning_log_second_stage.log

Jul 12 '24 06:07 sjtuljw520

@sjtuljw520 Interesting, I haven't encountered this problem before. As a sanity check, can you correct run inference of my checkpoints?

Jul 25 '24 05:07 ziqipang