Get loss and evaluation values=0 when training yolonas with custom dataset

Open junrur opened this issue 1 year ago • 1 comments

💡 Your Question

I am trying to fine tune the yolonas_s model with my custom dataset. I have built a torch dataset saving data as [img, target], where the image value is normalized and the target contains annotation format [cls, xc, yc, h, w]. I followed the instruction from the notebooks, and get the output from dataset is (batch size = 8)

check batch tensor([[0.0000, 1.0000, 0.2172, 0.3635, 0.7229, 0.1956], [0.0000, 1.0000, 0.5685, 0.4042, 0.8083, 0.4953], [0.0000, 2.0000, 0.2213, 0.5281, 0.3937, 0.1803], [0.0000, 5.0000, 0.5486, 0.6104, 0.3000, 0.3009], [1.0000, 1.0000, 0.4781, 0.3923, 0.7817, 0.3456], [1.0000, 1.0000, 0.5044, 0.8864, 0.2212, 0.2895], [1.0000, 3.0000, 0.4789, 0.5147, 0.1534, 0.1649], [1.0000, 3.0000, 0.4754, 0.4056, 0.0678, 0.2421], [2.0000, 1.0000, 0.4684, 0.3521, 0.6917, 0.3419], [2.0000, 2.0000, 0.4596, 0.3490, 0.6854, 0.3290], [3.0000, 1.0000, 0.4254, 0.5504, 0.3864, 0.0922], [3.0000, 1.0000, 0.9122, 0.5682, 0.3280, 0.1726], [3.0000, 3.0000, 0.4171, 0.6526, 0.1290, 0.0706], [3.0000, 3.0000, 0.8901, 0.6497, 0.1341, 0.1069], [3.0000, 2.0000, 0.8545, 0.6999, 0.0598, 0.0587], [4.0000, 1.0000, 0.5477, 0.1264, 0.2528, 0.1036], [4.0000, 1.0000, 0.5490, 0.5282, 0.5491, 0.1719], [4.0000, 1.0000, 0.5945, 0.3000, 0.3407, 0.0516], [4.0000, 3.0000, 0.5930, 0.2667, 0.0296, 0.0411], [4.0000, 3.0000, 0.5961, 0.3167, 0.0852, 0.0297], [4.0000, 2.0000, 0.5518, 0.7574, 0.1019, 0.1714], [5.0000, 1.0000, 0.5059, 0.4979, 0.9847, 0.1680], [5.0000, 2.0000, 0.5012, 0.8750, 0.2222, 0.1461], [5.0000, 3.0000, 0.5043, 0.6097, 0.3472, 0.1523], [6.0000, 1.0000, 0.7160, 0.4976, 0.9953, 0.3735], [6.0000, 1.0000, 0.5779, 0.8797, 0.2186, 0.1528], [6.0000, 1.0000, 0.3773, 0.1997, 0.3962, 0.2145], [6.0000, 1.0000, 0.2593, 0.9505, 0.0896, 0.1420], [6.0000, 3.0000, 0.7114, 0.6973, 0.2217, 0.2284], [6.0000, 3.0000, 0.7160, 0.5314, 0.1792, 0.3086], [7.0000, 1.0000, 0.4184, 0.5014, 0.9944, 0.2633], [7.0000, 2.0000, 0.4203, 0.7410, 0.3236, 0.2281], [7.0000, 2.0000, 0.4172, 0.3076, 0.6042, 0.2594]], dtype=torch.float64)

This is how I loaded the dataset

traindata_sampler = torch.utils.data.RandomSampler(train_dataset) #sample from a shuffled dataset ：dataset_train train_batch_sampler = torch.utils.data.BatchSampler(traindata_sampler, batch_size=BATCH_SIZE, drop_last=True) train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_sampler=train_batch_sampler,num_workers=4, collate_fn=DetectionCollateFN())
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=BATCH_SIZE, num_workers=4, collate_fn=DetectionCollateFN())

And the train_params are (I changed some of them from the tutorial to simplify the training process)

train_params = { # ENABLING SILENT MODE "max_epochs": 1, 'silent_mode': False, "average_best_models": True, # "warmup_mode": "linear_epoch_step", # "warmup_initial_lr": 1e-6, # "lr_warmup_epochs": 0, # "initial_lr": 5e-4, # "lr_mode": "cosine", # "cosine_final_lr_ratio": 0.1, "lr_mode": {"StepLR": {"gamma": 1, "step_size": 100, "phase": Phase.TRAIN_EPOCH_END}}, "initial_lr": 0.0001, "optimizer": "Adam", "optimizer_params": {"weight_decay": 0.0001}, "zero_weight_decay_on_bias_and_bn": True, # "ema": True, # "ema_params": {"decay": 0.9, "decay_type": "threshold"}, "ema":False, "mixed_precision": True,
"loss": PPYoloELoss( use_static_assigner=False, num_classes=CLASSES, reg_max=16 ), "valid_metrics_list": [ DetectionMetrics_050( score_thres=0.1, top_k_predictions=20, num_cls=CLASSES, normalize_targets=True, post_prediction_callback=PPYoloEPostPredictionCallback( score_threshold=0.01, nms_top_k=256, max_predictions=20, nms_threshold=0.7 ) ) ], "metric_to_watch": '[email protected]',

}

In general, the code can run without error and the model has been trained for 100 epochs. The loss_cls looks normal, but other values kept at value of 0.

SUMMARY OF EPOCH 0 ├── Train │ ├── Ppyoloeloss/loss_cls = 2.6048 │ ├── Ppyoloeloss/loss_iou = 0.0 │ ├── Ppyoloeloss/loss_dfl = 0.0 │ └── Ppyoloeloss/loss = 2.6048 └── Validation ├── Ppyoloeloss/loss_cls = 2.1583 ├── Ppyoloeloss/loss_iou = 0.0 ├── Ppyoloeloss/loss_dfl = 0.0 ├── Ppyoloeloss/loss = 2.1583 ├── [email protected] = 0.0 ├── [email protected] = 0.0 ├── [email protected] = 0.0 ├── [email protected] = 0.0 └── Best_score_threshold = 0.0

I guess there are some mistakes in data format. Has anyone faced this problem before? Any help I would appreciate.

Versions

No response

Mar 08 '24 22:03 junrur

│ ├── Ppyoloeloss/loss_cls = 2.6048 │ ├── Ppyoloeloss/loss_iou = 0.0 │ ├── Ppyoloeloss/loss_dfl = 0.0 │ └── Ppyoloeloss/loss = 2.6048

indicates you don't have positive samples (e.g images has no objects). And since there are no positive targets for size regression - this term is zero. I suggest to double-check the dataset implementation.

May 07 '24 09:05 BloodAxe