mmtracking icon indicating copy to clipboard operation
mmtracking copied to clipboard

NAN loss when training on MOT20

Open sjtuytc opened this issue 3 years ago • 2 comments

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug When I need to train on MOT20 without any modification on the code, the loss is always NAN.

Reproduction

  1. What command or script did you run?
bash ./tools/dist_train.sh ./configs/det/faster-rcnn_r50_fpn_8e_mot20-half.py 8 \
--work-dir ./work_dirs/
  1. Did you make any modifications on the code or config? Did you understand what you have modified? No

  2. What dataset did you use and what task did you run? MOT20, training Environment

  3. Please run python mmtrack/utils/collect_env.py to collect necessary environment information and paste it here.

  4. You may add addition that may be helpful for locating the problem, such as

    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback If applicable, paste the error trackback here. image

A placeholder for trackback.

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

sjtuytc avatar Nov 08 '21 06:11 sjtuytc

I use the command: bash ./tools/dist_train.sh ./configs/det/faster-rcnn_r50_fpn_8e_mot20-half.py 8 \ --work-dir ./work_dirs/, and the detector is sucessfully trained on mot20. Please refer to the picture

GT9505 avatar Nov 08 '21 11:11 GT9505

If you still got error, try reducing your learning rate to 0.0001.

RoyAn2386 avatar Jun 20 '23 02:06 RoyAn2386