machinelearning
machinelearning copied to clipboard
Training with object detection , Loss : Nan
Hello,
I'm working on object detection with Coco data ( two datasets 1855 images or 58661 images) After some time I receive a Loss: NaN
What's the problem please knowing that I tested different datasets ?
Any suggestion please? , Thanks
System :window
Training on CPU
Batch size : 5 ~ 10
Update on 2024/02/23 (from @LittleLittleCloud )
@LittleLittleCloud Get it reproduced on the following dataset using mlnet 16.18.2, CPU and batch size to 10, epoch to 1. On GPU the training is successful
- https://automlbenchmark.blob.core.windows.net/dataset/od.v1i.coco.zip
First install mlnet-win-x64
16.18.2
dotnet tool install --global mlnet-win-x64 --version 16.18.2
Then, kick off the training
mlnet object-detection --dataset /path/to/coco.json --device cpu --epoch 1
@michaelgsharp my best guess is there's overflow when calculating focal loss?
https://github.com/dotnet/machinelearning/blob/902102e23d9bd825c44f203390801d7cc5d0275f/src/Microsoft.ML.TorchSharp/Loss/FocalLoss.cs#L37
Is there a solution or suggestion? please
@Cilouche which coco dataset are you using, could you share a link?
data : https://drive.google.com/drive/folders/1-dQPRdQ-MRp6mrPhnpng5pcgJsTZMg23 ,
I used this site https://drive.google.com/drive/folders/1-dQPRdQ-MRp6mrPhnpng5pcgJsTZMg23 to convert them to coco format
@Cilouche which site? The site link seems to be the same with the data you share
Yes sorry the site : https://roboflow.com/convert/pascal-voc-xml-to-coco-json?ref=blog.roboflow.com
I've also noticed that once the database is large, there's a loss Nan example: data = 100, epoch=8; all is well except for the precesion is low ~ 0.69
but from data ~= 1200 images, epoch= 5, 8 , 11 ; losses converge rapidly towards Nan
Update
I got it reproduced on my second training, thanks
Origianl post
Hey @Cilouche some updates here: I can't reproduce the NaN loss error using your dataset in the latest model builder main branch. Maybe it's already been fixed.
We haven't released model builder yet, but you can verify the latest bit in mlnet cli > 16.18.2 by installing mlnet-win-x64
and try object detection there. mlnet cli and model builder shares the same AutoML service so if you didn't see NaN error from mlnet cli, then you probably also won't see NaN error from model builder
steps to verify
- install the latest mlnet-win-x64 (https://www.nuget.org/packages/mlnet-win-x64)
- start object detection there
mlnet object-detection --dataset /path/to/coco.json --device cuda --epoch 1
Any suggestion or solution to bypass this problem, please ? Thanks
Try a smaller batch size, Maybe 1?
And GPU training doesn't produce NaN loss, is that also an option for you(training on GPU)
It's works on GPU thanks.