machinelearning Training with object detection , Loss : Nan

Hello,

I'm working on object detection with Coco data ( two datasets 1855 images or 58661 images) After some time I receive a Loss: NaN

What's the problem please knowing that I tested different datasets ?

Any suggestion please? , Thanks

System :window Training on CPU Batch size : 5 ~ 10

Update on 2024/02/23 (from @LittleLittleCloud )

@LittleLittleCloud Get it reproduced on the following dataset using mlnet 16.18.2, CPU and batch size to 10, epoch to 1. On GPU the training is successful

https://automlbenchmark.blob.core.windows.net/dataset/od.v1i.coco.zip

First install mlnet-win-x64 16.18.2

dotnet tool install --global mlnet-win-x64 --version 16.18.2

Then, kick off the training

mlnet object-detection --dataset /path/to/coco.json --device cpu --epoch 1

Feb 01 '24 12:02 Cilouche

@michaelgsharp my best guess is there's overflow when calculating focal loss?

https://github.com/dotnet/machinelearning/blob/902102e23d9bd825c44f203390801d7cc5d0275f/src/Microsoft.ML.TorchSharp/Loss/FocalLoss.cs#L37

Feb 02 '24 19:02 LittleLittleCloud

Is there a solution or suggestion? please

Feb 05 '24 08:02 Cilouche

@Cilouche which coco dataset are you using, could you share a link?

Feb 05 '24 19:02 LittleLittleCloud

data : https://drive.google.com/drive/folders/1-dQPRdQ-MRp6mrPhnpng5pcgJsTZMg23 ,

I used this site https://drive.google.com/drive/folders/1-dQPRdQ-MRp6mrPhnpng5pcgJsTZMg23 to convert them to coco format

Feb 06 '24 15:02 Cilouche

@Cilouche which site? The site link seems to be the same with the data you share

Feb 06 '24 19:02 LittleLittleCloud

Yes sorry the site : https://roboflow.com/convert/pascal-voc-xml-to-coco-json?ref=blog.roboflow.com

I've also noticed that once the database is large, there's a loss Nan example: data = 100, epoch=8; all is well except for the precesion is low ~ 0.69

but from data ~= 1200 images, epoch= 5, 8 , 11 ; losses converge rapidly towards Nan

Feb 07 '24 15:02 Cilouche

Update

I got it reproduced on my second training, thanks

Origianl post

Hey @Cilouche some updates here: I can't reproduce the NaN loss error using your dataset in the latest model builder main branch. Maybe it's already been fixed.

We haven't released model builder yet, but you can verify the latest bit in mlnet cli > 16.18.2 by installing mlnet-win-x64 and try object detection there. mlnet cli and model builder shares the same AutoML service so if you didn't see NaN error from mlnet cli, then you probably also won't see NaN error from model builder

steps to verify

install the latest mlnet-win-x64 （https://www.nuget.org/packages/mlnet-win-x64)
start object detection there

mlnet object-detection --dataset /path/to/coco.json --device cuda --epoch 1

Feb 21 '24 19:02 LittleLittleCloud

Any suggestion or solution to bypass this problem, please ? Thanks

Feb 22 '24 15:02 Cilouche

Try a smaller batch size, Maybe 1?

And GPU training doesn't produce NaN loss, is that also an option for you(training on GPU)

Feb 22 '24 20:02 LittleLittleCloud

It's works on GPU thanks.

Feb 23 '24 12:02 Cilouche

machinelearning machinelearning copied to clipboard

Training with object detection , Loss : Nan

Update on 2024/02/23 (from @LittleLittleCloud )

Update

Origianl post

machinelearning
machinelearning copied to clipboard