machinelearning icon indicating copy to clipboard operation
machinelearning copied to clipboard

Training with object detection , Loss : Nan

Open Cilouche opened this issue 1 year ago • 10 comments

Hello,

I'm working on object detection with Coco data ( two datasets 1855 images or 58661 images) After some time I receive a Loss: NaN

What's the problem please knowing that I tested different datasets ?

Any suggestion please? , Thanks

System :window Training on CPU Batch size : 5 ~ 10 image

Update on 2024/02/23 (from @LittleLittleCloud )

@LittleLittleCloud Get it reproduced on the following dataset using mlnet 16.18.2, CPU and batch size to 10, epoch to 1. On GPU the training is successful

  • https://automlbenchmark.blob.core.windows.net/dataset/od.v1i.coco.zip

First install mlnet-win-x64 16.18.2

dotnet tool install --global mlnet-win-x64 --version 16.18.2

Then, kick off the training

mlnet object-detection --dataset /path/to/coco.json --device cpu --epoch 1

Cilouche avatar Feb 01 '24 12:02 Cilouche

@michaelgsharp my best guess is there's overflow when calculating focal loss?

https://github.com/dotnet/machinelearning/blob/902102e23d9bd825c44f203390801d7cc5d0275f/src/Microsoft.ML.TorchSharp/Loss/FocalLoss.cs#L37

LittleLittleCloud avatar Feb 02 '24 19:02 LittleLittleCloud

Is there a solution or suggestion? please

Cilouche avatar Feb 05 '24 08:02 Cilouche

@Cilouche which coco dataset are you using, could you share a link?

LittleLittleCloud avatar Feb 05 '24 19:02 LittleLittleCloud

data : https://drive.google.com/drive/folders/1-dQPRdQ-MRp6mrPhnpng5pcgJsTZMg23 ,

I used this site https://drive.google.com/drive/folders/1-dQPRdQ-MRp6mrPhnpng5pcgJsTZMg23 to convert them to coco format

Cilouche avatar Feb 06 '24 15:02 Cilouche

@Cilouche which site? The site link seems to be the same with the data you share

LittleLittleCloud avatar Feb 06 '24 19:02 LittleLittleCloud

Yes sorry the site : https://roboflow.com/convert/pascal-voc-xml-to-coco-json?ref=blog.roboflow.com

I've also noticed that once the database is large, there's a loss Nan example: data = 100, epoch=8; all is well except for the precesion is low ~ 0.69

but from data ~= 1200 images, epoch= 5, 8 , 11 ; losses converge rapidly towards Nan

Cilouche avatar Feb 07 '24 15:02 Cilouche

Update

I got it reproduced on my second training, thanks

Origianl post

Hey @Cilouche some updates here: I can't reproduce the NaN loss error using your dataset in the latest model builder main branch. Maybe it's already been fixed.

We haven't released model builder yet, but you can verify the latest bit in mlnet cli > 16.18.2 by installing mlnet-win-x64 and try object detection there. mlnet cli and model builder shares the same AutoML service so if you didn't see NaN error from mlnet cli, then you probably also won't see NaN error from model builder

steps to verify

  • install the latest mlnet-win-x64 (https://www.nuget.org/packages/mlnet-win-x64)
  • start object detection there
mlnet object-detection --dataset /path/to/coco.json --device cuda --epoch 1

LittleLittleCloud avatar Feb 21 '24 19:02 LittleLittleCloud

Any suggestion or solution to bypass this problem, please ? Thanks

Cilouche avatar Feb 22 '24 15:02 Cilouche

Try a smaller batch size, Maybe 1?

And GPU training doesn't produce NaN loss, is that also an option for you(training on GPU)

LittleLittleCloud avatar Feb 22 '24 20:02 LittleLittleCloud

It's works on GPU thanks.

Cilouche avatar Feb 23 '24 12:02 Cilouche