How to solve the problem of loss NAN?

Open xiaoxiAries opened this issue 3 years ago • 3 comments

Hi,

I reprocess this code on SSv2 datatset. I follow the blr 0.1, and use 2 gpus with batchsize of 7 (the valid total batchsize is 14). But the loss is Nan when epoch is 14. How to solve this problem? Thanks~

Oct 06 '22 08:10 xiaoxiAries

Hi,

Which configuration do you use? Full-tuning baseline or AdaptFormer?

Oct 07 '22 00:10 ShoufaChen

Hi， I follow this configuration:

OMP_NUM_THREADS=1 python3 -m torch.distributed.launch
--nproc_per_node=2 --use_env main_video.py
--finetune /path/to/pre_trained/mae.pyth
--output_dir /path/to/output
--batch_size 7 --epochs 90 --blr 0.1 --weight_decay 0.0 --dist_eval
--data_path /path/to/SSV2 --data_set SSV2
--ffn_adapt

Oct 08 '22 07:10 xiaoxiAries

I am sorry I didn't experiment with your specific configuration. Reduce the learning rate and have a try.

Oct 09 '22 03:10 ShoufaChen