super-gradients
super-gradients copied to clipboard
No Bounding Boxes being plotted when doing inference on an image (Loading Trained Weights from Checkpoint)
Issue
I am doing custom training of the yolonas-medium model on a custom dataset. I have got Recall of 0.71 & Precision of 0.2 after 100+ epochs (Precision can probably be made better by changing some training parameters).
When I try to predict images using this model. There are no bounding boxes at all!
Another funny thing is, the same model when trained upto 50 epochs plots some boxes!
(I have made a demo google colab sheet and am linking the weights and test data at the bottom.)
Hope someone can help with this issue..... TIA!
Validation Code:
model = models.get(config.MODEL_NAME,
num_classes=config.NUM_CLASSES,
)
load_checkpoint_to_model(net=model, ckpt_local_path="/content/drive/MyDrive/backup/300Run/ckpt_best.pth")
trainer.test(
model=model,
metrics_progress_verbose=True,
test_loader=test_data,
test_metrics_list=DetectionMetrics_050(score_thres=0.1,
top_k_predictions=300,
num_cls=config.NUM_CLASSES,
normalize_targets=True,
post_prediction_callback=PPYoloEPostPredictionCallback(score_threshold=0.01,
nms_top_k=1000,
max_predictions=300,
nms_threshold=0.1)
))
Inference Code
image_processor = ComposeProcessing(
[
DetectionLongestMaxSizeRescale(output_shape=(636, 636)),
DetectionCenterPadding(output_shape=(640, 640), pad_value=114),
StandardizeImage(max_value=255.0),
ImagePermute(permutation=(2, 0, 1)),
]
)
model = models.get(Models.YOLO_NAS_M, checkpoint_path="/content/drive/MyDrive/backup/300Run/ckpt_best.pth", num_classes=config.NUM_CLASSES)
model.set_dataset_processing_params(
class_names=["0"],
# num_classes=config.NUM_CLASSES,
image_processor=image_processor,
iou=0.35, conf=0.25,
)
images_predictions = model.predict(IMAGES[0], iou=0.1, conf=0.5)
images_predictions.show(box_thickness=10, show_confidence=True)
images_predictions = model.predict(IMAGES[0], iou=0.1, conf=0.1)
images_predictions.show(box_thickness=10, show_confidence=True)
Relevant Validation Log (On 6 Images but stats are similar on the whole dataset):
Test: 100%|██████████| 1/1 [08:07<00:00, 487.49s/it, [email protected]=0.408, [email protected]=0.286, [email protected]=0.71, [email protected]=0.319]
Google Colab Link for the code
Dataset Link
Weights Link
Versions
No response
Hi @MSSRPRAD , I am not sure why. It could (most likely) be due to no prediction on the input image, or alternatively to some issue with drawing on your image. Could you please iterate over the predictions and see:
- If there is any prediction
- If yes, are the predictions weird? (Coordinates out of image for instance) If you're not sure how to, we have a look at this page
Let me know what you get, hoping this helps
@Louis-Dupont
- If there is any prediction
Maybe this is an issue with the training I am sharing some training logs and the code.
CONFIG:
--------- config parameters ----------
{
"arch_params": {
"schema": null
},
"checkpoint_params": {
"load_checkpoint": false,
"schema": null
},
"training_hyperparams": {
"lr_warmup_epochs": 3,
"lr_warmup_steps": 0,
"lr_cooldown_epochs": 0,
"warmup_initial_lr": 1e-06,
"cosine_final_lr_ratio": 0.1,
"optimizer": "Adam",
"optimizer_params": {
"weight_decay": 0.0001
},
"criterion_params": {},
"ema": false,
"batch_accumulate": 1,
"ema_params": {
"decay": 0.9,
"decay_type": "threshold"
},
"zero_weight_decay_on_bias_and_bn": true,
"load_opt_params": true,
"run_validation_freq": 1,
"save_model": true,
"metric_to_watch": "[email protected]",
"launch_tensorboard": false,
"tb_files_user_prompt": false,
"silent_mode": false,
"mixed_precision": true,
"tensorboard_port": null,
"save_ckpt_epoch_list": [],
"average_best_models": true,
"dataset_statistics": false,
"save_tensorboard_to_s3": false,
"lr_schedule_function": null,
"train_metrics_list": [],
"valid_metrics_list": [
"DetectionMetrics_050(\n (post_prediction_callback): PPYoloEPostPredictionCallback()\n)"
],
"greater_metric_to_watch_is_better": true,
"precise_bn": false,
"precise_bn_batch_size": null,
"seed": 42,
"lr_mode": "cosine",
"phase_callbacks": null,
"log_installed_packages": true,
"sg_logger": "base_sg_logger",
"sg_logger_params": {
"tb_files_user_prompt": false,
"project_name": "",
"launch_tensorboard": false,
"tensorboard_port": null,
"save_checkpoints_remote": false,
"save_tensorboard_remote": false,
"save_logs_remote": false
},
"warmup_mode": "linear_epoch_step",
"step_lr_update_freq": null,
"lr_updates": [],
"clip_grad_norm": null,
"pre_prediction_callback": null,
"ckpt_best_name": "ckpt_best.pth",
"enable_qat": false,
"resume": false,
"resume_path": null,
"ckpt_name": "ckpt_latest.pth",
"resume_strict_load": false,
"sync_bn": false,
"kill_ddp_pgroup_on_end": true,
"max_train_batches": null,
"max_valid_batches": null,
"resume_from_remote_sg_logger": false,
"schema": {
"type": "object",
"properties": {
"max_epochs": {
"type": "number",
"minimum": 1
},
"lr_decay_factor": {
"type": "number",
"minimum": 0,
"maximum": 1
},
"lr_warmup_epochs": {
"type": "number",
"minimum": 0,
"maximum": 10
},
"initial_lr": {
"type": "number",
"exclusiveMinimum": 0,
"maximum": 10
}
},
"if": {
"properties": {
"lr_mode": {
"const": "step"
}
}
},
"then": {
"required": [
"lr_updates",
"lr_decay_factor"
]
},
"required": [
"max_epochs",
"lr_mode",
"initial_lr",
"loss"
]
},
"initial_lr": 0.0005,
"eyma": true,
"max_epochs": 300,
"loss": "PPYoloELoss(\n (static_assigner): ATSSAssigner()\n (assigner): TaskAlignedAssigner()\n)"
},
"dataset_params": {
"train_dataset_params": "{'data_dir': './SKU-110K/converted', 'images_dir': 'train/images', 'labels_dir': 'train/labelsn', 'classes': ['0'], 'input_dim': [640, 640], 'cache_dir': None, 'cache': False, 'transforms': [{'DetectionMosaic': {'input_dim': [640, 640], 'prob': 1.0}}, {'DetectionRandomAffine': {'degrees': 10.0, 'translate': 0.1, 'scales': [0.1, 2], 'shear': 2.0, 'target_size': [640, 640], 'filter_box_candidates': True, 'wh_thr': 2, 'area_thr': 0.1, 'ar_thr': 20}}, {'DetectionMixup': {'input_dim': [640, 640], 'mixup_scale': [0.5, 1.5], 'prob': 1.0, 'flip_prob': 0.5}}, {'DetectionHSV': {'prob': 1.0, 'hgain': 5, 'sgain': 30, 'vgain': 30}}, {'DetectionHorizontalFlip': {'prob': 0.5}}, {'DetectionPaddedRescale': {'input_dim': [640, 640], 'max_targets': 120}}, {'DetectionTargetsFormatTransform': {'input_dim': [640, 640], 'output_format': 'LABEL_CXCYWH'}}], 'class_inclusion_list': None, 'max_num_samples': None}",
"train_dataloader_params": {
"batch_size": 4,
"num_workers": 8,
"shuffle": true,
"drop_last": true,
"pin_memory": true,
"collate_fn": "<super_gradients.training.utils.detection_utils.DetectionCollateFN object at 0x7f2de929c9d0>"
},
"valid_dataset_params": "{'data_dir': './SKU-110K/converted', 'images_dir': 'val/images', 'labels_dir': 'val/labelsn', 'classes': ['0'], 'input_dim': [640, 640], 'cache_dir': None, 'cache': False, 'transforms': [{'DetectionPaddedRescale': {'input_dim': [640, 640]}}, {'DetectionTargetsFormatTransform': {'max_targets': 50, 'input_dim': [640, 640], 'output_format': 'LABEL_CXCYWH'}}], 'class_inclusion_list': None, 'max_num_samples': None}",
"valid_dataloader_params": {
"batch_size": 4,
"num_workers": 8,
"shuffle": true,
"drop_last": true,
"pin_memory": true,
"collate_fn": "<super_gradients.training.utils.detection_utils.DetectionCollateFN object at 0x7f2de929c9d0>"
},
"schema": null
},
CODE:
model = models.get(config.MODEL_NAME,
num_classes=config.NUM_CLASSES
)
load_checkpoint_to_model(net=model, ckpt_local_path="/user/yolonas/SKU-110K/checkpoints/FirstRun/ckpt_latest.pth")
First I trained a model for 50 epochs and then loaded the checkpoint of the training and trained for a further 300+ epochs (It got interrupted in the middle at 105th epoch but that shouldn't make a difference....)
TRAIN LOGS (50 Epochs)
===========================================================
SUMMARY OF EPOCH 50
├── Training
│ ├── Ppyoloeloss/loss = 1.7945
│ │ ├── Best until now = 1.7942 ([31m↗ 0.0003[0m)
│ │ └── Epoch N-1 = 1.7942 ([31m↗ 0.0003[0m)
│ ├── Ppyoloeloss/loss_cls = 0.8438
│ │ ├── Best until now = 0.6648 ([31m↗ 0.179[0m)
│ │ └── Epoch N-1 = 0.8429 ([31m↗ 0.0009[0m)
│ ├── Ppyoloeloss/loss_dfl = 0.8001
│ │ ├── Best until now = 0.8002 ([32m↘ -1e-04[0m)
│ │ └── Epoch N-1 = 0.8002 ([32m↘ -1e-04[0m)
│ └── Ppyoloeloss/loss_iou = 0.2203
│ ├── Best until now = 0.2205 ([32m↘ -0.0002[0m)
│ └── Epoch N-1 = 0.2205 ([32m↘ -0.0002[0m)
└── Validation
├── [email protected] = 0.408
│ ├── Best until now = 0.4062 ([32m↗ 0.0018[0m)
│ └── Epoch N-1 = 0.3985 ([32m↗ 0.0095[0m)
├── [email protected] = 0.265
│ ├── Best until now = 0.2612 ([32m↗ 0.0038[0m)
│ └── Epoch N-1 = 0.2401 ([32m↗ 0.0249[0m)
├── Ppyoloeloss/loss = 1.8841
│ ├── Best until now = 1.8717 ([31m↗ 0.0124[0m)
│ └── Epoch N-1 = 1.8993 ([32m↘ -0.0152[0m)
├── Ppyoloeloss/loss_cls = 1.0272
│ ├── Best until now = 0.9789 ([31m↗ 0.0483[0m)
│ └── Epoch N-1 = 1.0314 ([32m↘ -0.0042[0m)
├── Ppyoloeloss/loss_dfl = 0.7045
│ ├── Best until now = 0.7038 ([31m↗ 0.0006[0m)
│ └── Epoch N-1 = 0.7123 ([32m↘ -0.0078[0m)
├── Ppyoloeloss/loss_iou = 0.2019
│ ├── Best until now = 0.1994 ([31m↗ 0.0025[0m)
│ └── Epoch N-1 = 0.2047 ([32m↘ -0.0028[0m)
├── [email protected] = 0.2769
│ ├── Best until now = 0.2761 ([32m↗ 0.0008[0m)
│ └── Epoch N-1 = 0.272 ([32m↗ 0.0049[0m)
└── [email protected] = 0.7747
├── Best until now = 0.7737 ([32m↗ 0.001[0m)
└── Epoch N-1 = 0.7447 ([32m↗ 0.03[0m)
===========================================================
[2023-06-18 15:34:35] INFO - base_sg_logger.py - [CLEANUP] - Successfully stopped system monitoring process
------------------------------Finished Training-------------------------------------------
TRAINING LOGS (100+ Epochs)
===========================================================
SUMMARY OF EPOCH 0
├── Training
│ ├── Ppyoloeloss/loss = 1.785
│ ├── Ppyoloeloss/loss_cls = 0.8411
│ ├── Ppyoloeloss/loss_dfl = 0.7957
│ └── Ppyoloeloss/loss_iou = 0.2184
└── Validation
├── [email protected] = 0.402
├── [email protected] = 0.2514
├── Ppyoloeloss/loss = 1.8759
├── Ppyoloeloss/loss_cls = 1.0222
├── Ppyoloeloss/loss_dfl = 0.7035
├── Ppyoloeloss/loss_iou = 0.2008
├── [email protected] = 0.2726
└── [email protected] = 0.7649
===========================================================
===========================================================
SUMMARY OF EPOCH 104
├── Training
│ ├── Ppyoloeloss/loss = 1.836
│ │ ├── Best until now = 1.785 ([31m↗ 0.051[0m)
│ │ └── Epoch N-1 = 1.8394 ([32m↘ -0.0034[0m)
│ ├── Ppyoloeloss/loss_cls = 0.8558
│ │ ├── Best until now = 0.8411 ([31m↗ 0.0147[0m)
│ │ └── Epoch N-1 = 0.8555 ([31m↗ 0.0003[0m)
│ ├── Ppyoloeloss/loss_dfl = 0.8234
│ │ ├── Best until now = 0.7957 ([31m↗ 0.0278[0m)
│ │ └── Epoch N-1 = 0.8273 ([32m↘ -0.0039[0m)
│ └── Ppyoloeloss/loss_iou = 0.2274
│ ├── Best until now = 0.2184 ([31m↗ 0.009[0m)
│ └── Epoch N-1 = 0.2281 ([32m↘ -0.0007[0m)
└── Validation
├── [email protected] = 0.3691
│ ├── Best until now = 0.402 ([31m↘ -0.0329[0m)
│ └── Epoch N-1 = 0.3829 ([31m↘ -0.0138[0m)
├── [email protected] = 0.2105
│ ├── Best until now = 0.2517 ([31m↘ -0.0412[0m)
│ └── Epoch N-1 = 0.2303 ([31m↘ -0.0198[0m)
├── Ppyoloeloss/loss = 1.9345
│ ├── Best until now = 1.8759 ([31m↗ 0.0586[0m)
│ └── Epoch N-1 = 1.9501 ([32m↘ -0.0156[0m)
├── Ppyoloeloss/loss_cls = 1.0391
│ ├── Best until now = 1.0148 ([31m↗ 0.0243[0m)
│ └── Epoch N-1 = 1.0539 ([32m↘ -0.0149[0m)
├── Ppyoloeloss/loss_dfl = 0.7365
│ ├── Best until now = 0.7035 ([31m↗ 0.033[0m)
│ └── Epoch N-1 = 0.7315 ([31m↗ 0.005[0m)
├── Ppyoloeloss/loss_iou = 0.2109
│ ├── Best until now = 0.2008 ([31m↗ 0.0101[0m)
│ └── Epoch N-1 = 0.2122 ([32m↘ -0.0013[0m)
├── [email protected] = 0.2476
│ ├── Best until now = 0.2744 ([31m↘ -0.0268[0m)
│ └── Epoch N-1 = 0.2632 ([31m↘ -0.0156[0m)
└── [email protected] = 0.7242
├── Best until now = 0.7649 ([31m↘ -0.0407[0m)
└── Epoch N-1 = 0.7023 ([32m↗ 0.0219[0m)
===========================================================
I had the same problem and it solved by changing my version of cuda from 11.8 to 10.2 , but i do not Know why @BloodAxe @Louis-Dupont @MSSRPRAD
Hello @MSSRPRAD ..were you able to resolve this issue?.. Actually I am facing similar bug and my case is just like yours..So please suggest.. thanks
Attached is my model training file. I have got ~8000 images. i trained my model for 20 epochs as per above training file. Due to memory issues my model interrupted in between but i resumed training from checkpoint file loaded epoch. But when I try to test image it simply loads image without any bounding box. Please help
Hello @MSSRPRAD ..were you able to resolve this issue?.. Actually I am facing similar bug and my case is just like yours..So please suggest.. thanks
Hello @prernabhadwal . I was not able to figure it out then. I have not worked on this for almost a year so might not be able to help.
I used to encounter the same problem, but after I removed set_dataset_processing_params, the problem was solved.
BTW, if you encounter the error message below after you remove set_dataset_processing_params, you can review issue #1739 for more information.
RuntimeError: You must set the dataset processing parameters before calling predict.
Please call model.set_dataset_processing_params(...) first.