The training can not converge and the value of grad_norm is nan.
Hi, thank you for sharing the code. I try to train my model with config file "configs/tracking/petr/f1_q500_800x320.py" and "configs/tracking/petr/f3_q500_800x320.py", but both the training of first stage (with f1_q500_800x320.py) and second stage (with f3_q500_800x320.py) can not converge. specially, the grad_norm becomes nan during traning.
you can see the traning log in the link below. Can you help to what happens here. Maybe there are some mistakes in the config file? https://github.com/sjtuljw520/papers_and_others/blob/main/traning_log_first_stage.log https://github.com/sjtuljw520/papers_and_others/blob/main/traning_log_second_stage.log
@sjtuljw520 Interesting, I haven't encountered this problem before. As a sanity check, can you correct run inference of my checkpoints?