yolov5 Loss computation sometimes cause nan values

trafficstars

Search before asking

[X] I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training

Bug

These days when I'm trying to fine tune my model after pruning by training for several epochs, I found that loss value becomes nan from time to time. By setting breakpoints and checking, I found that there's a bug in metrics.py Sometimes, if the prediction of some bounding box has a width or height of 0, it turns out to be nan values! Since in CIoU computation, h2 and h1 are used as dividers here.

Environment

No response

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

[x] Yes I'd like to help by submitting a PR!

Nov 15 '24 08:11 tobymuller233

👋 Hello @tobymuller233, thank you for your interest in YOLOv5 🚀! It seems like you're encountering a nan values issue during training, and there might be a potential bug in the metrics.py file. To assist, we'll need a bit more information.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us understand and debug the issue. This would include steps to replicate the bug, relevant sections of your code, and any specific error messages.

Additionally, it would be helpful to know more about your environment setup, such as the version of Python, PyTorch, and any other dependencies you are using.

If you have any further insights, like dataset characteristics or specific conditions that might trigger this issue, do share those as well.

Please note that this is an automated response, and an Ultralytics engineer will review your issue and provide further assistance soon. Thank you for your patience and help in improving YOLOv5! 🚀✨

Nov 15 '24 08:11 UltralyticsAssistant

@tobymuller233 thank you for reporting this potential issue with loss computation. You've identified an important edge case where predictions with zero width or height could cause NaN values during CIoU loss calculation.

Before proceeding with a PR, please verify this behavior using the latest version of YOLOv5 as there have been several loss computation improvements. If you can provide a minimal reproducible example (MRE) following our MRE guide, it would help us investigate the issue more effectively.

For now, you could add a small epsilon value to prevent division by zero in the height calculations. However, we should also investigate why the model is predicting zero-sized bounding boxes during training, as this may indicate other underlying issues with the training process or data.

If you'd like to submit a PR, please ensure it includes:

The MRE demonstrating the issue
Your proposed fix
Test cases verifying the solution

Nov 16 '24 01:11 pderrenger

@pderrenger I've also encountered the same edge case.

Here's how I ran into this issue: I'm training on a dataset called "RarePlanes" (The real section of the dataset). At the beginning, I trained a model using the Silu activation function. Now, I'm fine-tuning the same model just with LeakyRelu as the activation function, and I'm using the previous Silu-trained model as my initial weights. When I'm training the Leakyrelu model (yolov3-tiny model, lr0=0.01, with AdamW optimzier, with the same hyp as scratch low) I'm consisently encountering NAN gradients, after just one batch.

The solution I've found is, I've modified the bbox_iou by adding an eplison to the width and height to prevent them from being zero. This resolves the NAN issue, but of course it affects the training results.

Dec 22 '24 10:12 DianaMaz

Thank you for sharing your experience, @Dianagle2. It seems like the issue arises due to zero width or height predictions causing division by zero in the bbox_iou or CIoU calculations. Adding a small epsilon is indeed a practical workaround to prevent NaN values, but as you noted, it can affect the training results.

To address this more systematically:

Ensure your dataset annotations and input preprocessing are correct, as malformed annotations can sometimes lead to such edge cases.
If using a modified model or optimizer, verify compatibility with the training pipeline and hyperparameters.
If you haven’t already, test with the latest YOLOv5 release to ensure you’re benefiting from the latest updates and fixes.

If you'd like to contribute your fix for broader discussion, consider submitting a pull request to the repository. Let us know if you need further assistance!

Dec 22 '24 17:12 pderrenger

yolov5 yolov5 copied to clipboard

Loss computation sometimes cause nan values

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

yolov5
yolov5 copied to clipboard