super-gradients icon indicating copy to clipboard operation
super-gradients copied to clipboard

No Bounding Boxes being plotted when doing inference on an image (Loading Trained Weights from Checkpoint)

Open MSSRPRAD opened this issue 2 years ago • 7 comments
trafficstars

Issue

I am doing custom training of the yolonas-medium model on a custom dataset. I have got Recall of 0.71 & Precision of 0.2 after 100+ epochs (Precision can probably be made better by changing some training parameters).

When I try to predict images using this model. There are no bounding boxes at all!

Another funny thing is, the same model when trained upto 50 epochs plots some boxes!

(I have made a demo google colab sheet and am linking the weights and test data at the bottom.)

Hope someone can help with this issue..... TIA!

Validation Code:

model = models.get(config.MODEL_NAME,
                   num_classes=config.NUM_CLASSES,
                   )
load_checkpoint_to_model(net=model, ckpt_local_path="/content/drive/MyDrive/backup/300Run/ckpt_best.pth")
trainer.test(
    model=model,
    metrics_progress_verbose=True,
    test_loader=test_data,
    test_metrics_list=DetectionMetrics_050(score_thres=0.1,
                                                   top_k_predictions=300,
                                                   num_cls=config.NUM_CLASSES,
                                                   normalize_targets=True,
                                                   post_prediction_callback=PPYoloEPostPredictionCallback(score_threshold=0.01,
                                                                                                          nms_top_k=1000,
                                                                                                          max_predictions=300,
                                                                                                          nms_threshold=0.1)
                                                  ))

Inference Code

image_processor = ComposeProcessing(
    [
        DetectionLongestMaxSizeRescale(output_shape=(636, 636)),
        DetectionCenterPadding(output_shape=(640, 640), pad_value=114),
        StandardizeImage(max_value=255.0),
        ImagePermute(permutation=(2, 0, 1)),
    ]
)
model = models.get(Models.YOLO_NAS_M, checkpoint_path="/content/drive/MyDrive/backup/300Run/ckpt_best.pth", num_classes=config.NUM_CLASSES)

model.set_dataset_processing_params(
    class_names=["0"],
    # num_classes=config.NUM_CLASSES,
    image_processor=image_processor,
    iou=0.35, conf=0.25,
)
images_predictions = model.predict(IMAGES[0], iou=0.1, conf=0.5)
images_predictions.show(box_thickness=10, show_confidence=True)
images_predictions = model.predict(IMAGES[0], iou=0.1, conf=0.1)
images_predictions.show(box_thickness=10, show_confidence=True)

Relevant Validation Log (On 6 Images but stats are similar on the whole dataset):

Test: 100%|██████████| 1/1 [08:07<00:00, 487.49s/it, [email protected]=0.408, [email protected]=0.286, [email protected]=0.71, [email protected]=0.319]

Google Colab Link for the code

Link

Dataset Link

Link

Weights Link

Link

Versions

No response

MSSRPRAD avatar Jul 11 '23 13:07 MSSRPRAD

Hi @MSSRPRAD , I am not sure why. It could (most likely) be due to no prediction on the input image, or alternatively to some issue with drawing on your image. Could you please iterate over the predictions and see:

  1. If there is any prediction
  2. If yes, are the predictions weird? (Coordinates out of image for instance) If you're not sure how to, we have a look at this page

Let me know what you get, hoping this helps

Louis-Dupont avatar Jul 13 '23 12:07 Louis-Dupont

@Louis-Dupont

  1. If there is any prediction image

Maybe this is an issue with the training I am sharing some training logs and the code.

CONFIG:

--------- config parameters ----------
{
    "arch_params": {
        "schema": null
    },
    "checkpoint_params": {
        "load_checkpoint": false,
        "schema": null
    },
    "training_hyperparams": {
        "lr_warmup_epochs": 3,
        "lr_warmup_steps": 0,
        "lr_cooldown_epochs": 0,
        "warmup_initial_lr": 1e-06,
        "cosine_final_lr_ratio": 0.1,
        "optimizer": "Adam",
        "optimizer_params": {
            "weight_decay": 0.0001
        },
        "criterion_params": {},
        "ema": false,
        "batch_accumulate": 1,
        "ema_params": {
            "decay": 0.9,
            "decay_type": "threshold"
        },
        "zero_weight_decay_on_bias_and_bn": true,
        "load_opt_params": true,
        "run_validation_freq": 1,
        "save_model": true,
        "metric_to_watch": "[email protected]",
        "launch_tensorboard": false,
        "tb_files_user_prompt": false,
        "silent_mode": false,
        "mixed_precision": true,
        "tensorboard_port": null,
        "save_ckpt_epoch_list": [],
        "average_best_models": true,
        "dataset_statistics": false,
        "save_tensorboard_to_s3": false,
        "lr_schedule_function": null,
        "train_metrics_list": [],
        "valid_metrics_list": [
            "DetectionMetrics_050(\n  (post_prediction_callback): PPYoloEPostPredictionCallback()\n)"
        ],
        "greater_metric_to_watch_is_better": true,
        "precise_bn": false,
        "precise_bn_batch_size": null,
        "seed": 42,
        "lr_mode": "cosine",
        "phase_callbacks": null,
        "log_installed_packages": true,
        "sg_logger": "base_sg_logger",
        "sg_logger_params": {
            "tb_files_user_prompt": false,
            "project_name": "",
            "launch_tensorboard": false,
            "tensorboard_port": null,
            "save_checkpoints_remote": false,
            "save_tensorboard_remote": false,
            "save_logs_remote": false
        },
        "warmup_mode": "linear_epoch_step",
        "step_lr_update_freq": null,
        "lr_updates": [],
        "clip_grad_norm": null,
        "pre_prediction_callback": null,
        "ckpt_best_name": "ckpt_best.pth",
        "enable_qat": false,
        "resume": false,
        "resume_path": null,
        "ckpt_name": "ckpt_latest.pth",
        "resume_strict_load": false,
        "sync_bn": false,
        "kill_ddp_pgroup_on_end": true,
        "max_train_batches": null,
        "max_valid_batches": null,
        "resume_from_remote_sg_logger": false,
        "schema": {
            "type": "object",
            "properties": {
                "max_epochs": {
                    "type": "number",
                    "minimum": 1
                },
                "lr_decay_factor": {
                    "type": "number",
                    "minimum": 0,
                    "maximum": 1
                },
                "lr_warmup_epochs": {
                    "type": "number",
                    "minimum": 0,
                    "maximum": 10
                },
                "initial_lr": {
                    "type": "number",
                    "exclusiveMinimum": 0,
                    "maximum": 10
                }
            },
            "if": {
                "properties": {
                    "lr_mode": {
                        "const": "step"
                    }
                }
            },
            "then": {
                "required": [
                    "lr_updates",
                    "lr_decay_factor"
                ]
            },
            "required": [
                "max_epochs",
                "lr_mode",
                "initial_lr",
                "loss"
            ]
        },
        "initial_lr": 0.0005,
        "eyma": true,
        "max_epochs": 300,
        "loss": "PPYoloELoss(\n  (static_assigner): ATSSAssigner()\n  (assigner): TaskAlignedAssigner()\n)"
    },
    "dataset_params": {
        "train_dataset_params": "{'data_dir': './SKU-110K/converted', 'images_dir': 'train/images', 'labels_dir': 'train/labelsn', 'classes': ['0'], 'input_dim': [640, 640], 'cache_dir': None, 'cache': False, 'transforms': [{'DetectionMosaic': {'input_dim': [640, 640], 'prob': 1.0}}, {'DetectionRandomAffine': {'degrees': 10.0, 'translate': 0.1, 'scales': [0.1, 2], 'shear': 2.0, 'target_size': [640, 640], 'filter_box_candidates': True, 'wh_thr': 2, 'area_thr': 0.1, 'ar_thr': 20}}, {'DetectionMixup': {'input_dim': [640, 640], 'mixup_scale': [0.5, 1.5], 'prob': 1.0, 'flip_prob': 0.5}}, {'DetectionHSV': {'prob': 1.0, 'hgain': 5, 'sgain': 30, 'vgain': 30}}, {'DetectionHorizontalFlip': {'prob': 0.5}}, {'DetectionPaddedRescale': {'input_dim': [640, 640], 'max_targets': 120}}, {'DetectionTargetsFormatTransform': {'input_dim': [640, 640], 'output_format': 'LABEL_CXCYWH'}}], 'class_inclusion_list': None, 'max_num_samples': None}",
        "train_dataloader_params": {
            "batch_size": 4,
            "num_workers": 8,
            "shuffle": true,
            "drop_last": true,
            "pin_memory": true,
            "collate_fn": "<super_gradients.training.utils.detection_utils.DetectionCollateFN object at 0x7f2de929c9d0>"
        },
        "valid_dataset_params": "{'data_dir': './SKU-110K/converted', 'images_dir': 'val/images', 'labels_dir': 'val/labelsn', 'classes': ['0'], 'input_dim': [640, 640], 'cache_dir': None, 'cache': False, 'transforms': [{'DetectionPaddedRescale': {'input_dim': [640, 640]}}, {'DetectionTargetsFormatTransform': {'max_targets': 50, 'input_dim': [640, 640], 'output_format': 'LABEL_CXCYWH'}}], 'class_inclusion_list': None, 'max_num_samples': None}",
        "valid_dataloader_params": {
            "batch_size": 4,
            "num_workers": 8,
            "shuffle": true,
            "drop_last": true,
            "pin_memory": true,
            "collate_fn": "<super_gradients.training.utils.detection_utils.DetectionCollateFN object at 0x7f2de929c9d0>"
        },
        "schema": null
    },

CODE:

model = models.get(config.MODEL_NAME, 
                   num_classes=config.NUM_CLASSES
                   )
load_checkpoint_to_model(net=model, ckpt_local_path="/user/yolonas/SKU-110K/checkpoints/FirstRun/ckpt_latest.pth")

First I trained a model for 50 epochs and then loaded the checkpoint of the training and trained for a further 300+ epochs (It got interrupted in the middle at 105th epoch but that shouldn't make a difference....)

TRAIN LOGS (50 Epochs)

===========================================================
SUMMARY OF EPOCH 50
├── Training
│   ├── Ppyoloeloss/loss = 1.7945
│   │   ├── Best until now = 1.7942 ([31m↗ 0.0003[0m)
│   │   └── Epoch N-1      = 1.7942 ([31m↗ 0.0003[0m)
│   ├── Ppyoloeloss/loss_cls = 0.8438
│   │   ├── Best until now = 0.6648 ([31m↗ 0.179[0m)
│   │   └── Epoch N-1      = 0.8429 ([31m↗ 0.0009[0m)
│   ├── Ppyoloeloss/loss_dfl = 0.8001
│   │   ├── Best until now = 0.8002 ([32m↘ -1e-04[0m)
│   │   └── Epoch N-1      = 0.8002 ([32m↘ -1e-04[0m)
│   └── Ppyoloeloss/loss_iou = 0.2203
│       ├── Best until now = 0.2205 ([32m↘ -0.0002[0m)
│       └── Epoch N-1      = 0.2205 ([32m↘ -0.0002[0m)
└── Validation
    ├── [email protected] = 0.408
    │   ├── Best until now = 0.4062 ([32m↗ 0.0018[0m)
    │   └── Epoch N-1      = 0.3985 ([32m↗ 0.0095[0m)
    ├── [email protected] = 0.265
    │   ├── Best until now = 0.2612 ([32m↗ 0.0038[0m)
    │   └── Epoch N-1      = 0.2401 ([32m↗ 0.0249[0m)
    ├── Ppyoloeloss/loss = 1.8841
    │   ├── Best until now = 1.8717 ([31m↗ 0.0124[0m)
    │   └── Epoch N-1      = 1.8993 ([32m↘ -0.0152[0m)
    ├── Ppyoloeloss/loss_cls = 1.0272
    │   ├── Best until now = 0.9789 ([31m↗ 0.0483[0m)
    │   └── Epoch N-1      = 1.0314 ([32m↘ -0.0042[0m)
    ├── Ppyoloeloss/loss_dfl = 0.7045
    │   ├── Best until now = 0.7038 ([31m↗ 0.0006[0m)
    │   └── Epoch N-1      = 0.7123 ([32m↘ -0.0078[0m)
    ├── Ppyoloeloss/loss_iou = 0.2019
    │   ├── Best until now = 0.1994 ([31m↗ 0.0025[0m)
    │   └── Epoch N-1      = 0.2047 ([32m↘ -0.0028[0m)
    ├── [email protected] = 0.2769
    │   ├── Best until now = 0.2761 ([32m↗ 0.0008[0m)
    │   └── Epoch N-1      = 0.272  ([32m↗ 0.0049[0m)
    └── [email protected] = 0.7747
        ├── Best until now = 0.7737 ([32m↗ 0.001[0m)
        └── Epoch N-1      = 0.7447 ([32m↗ 0.03[0m)

===========================================================
[2023-06-18 15:34:35] INFO - base_sg_logger.py - [CLEANUP] - Successfully stopped system monitoring process
------------------------------Finished Training-------------------------------------------

TRAINING LOGS (100+ Epochs)

===========================================================
SUMMARY OF EPOCH 0
├── Training
│   ├── Ppyoloeloss/loss = 1.785
│   ├── Ppyoloeloss/loss_cls = 0.8411
│   ├── Ppyoloeloss/loss_dfl = 0.7957
│   └── Ppyoloeloss/loss_iou = 0.2184
└── Validation
    ├── [email protected] = 0.402
    ├── [email protected] = 0.2514
    ├── Ppyoloeloss/loss = 1.8759
    ├── Ppyoloeloss/loss_cls = 1.0222
    ├── Ppyoloeloss/loss_dfl = 0.7035
    ├── Ppyoloeloss/loss_iou = 0.2008
    ├── [email protected] = 0.2726
    └── [email protected] = 0.7649

===========================================================
===========================================================
SUMMARY OF EPOCH 104
├── Training
│   ├── Ppyoloeloss/loss = 1.836
│   │   ├── Best until now = 1.785  ([31m↗ 0.051[0m)
│   │   └── Epoch N-1      = 1.8394 ([32m↘ -0.0034[0m)
│   ├── Ppyoloeloss/loss_cls = 0.8558
│   │   ├── Best until now = 0.8411 ([31m↗ 0.0147[0m)
│   │   └── Epoch N-1      = 0.8555 ([31m↗ 0.0003[0m)
│   ├── Ppyoloeloss/loss_dfl = 0.8234
│   │   ├── Best until now = 0.7957 ([31m↗ 0.0278[0m)
│   │   └── Epoch N-1      = 0.8273 ([32m↘ -0.0039[0m)
│   └── Ppyoloeloss/loss_iou = 0.2274
│       ├── Best until now = 0.2184 ([31m↗ 0.009[0m)
│       └── Epoch N-1      = 0.2281 ([32m↘ -0.0007[0m)
└── Validation
    ├── [email protected] = 0.3691
    │   ├── Best until now = 0.402  ([31m↘ -0.0329[0m)
    │   └── Epoch N-1      = 0.3829 ([31m↘ -0.0138[0m)
    ├── [email protected] = 0.2105
    │   ├── Best until now = 0.2517 ([31m↘ -0.0412[0m)
    │   └── Epoch N-1      = 0.2303 ([31m↘ -0.0198[0m)
    ├── Ppyoloeloss/loss = 1.9345
    │   ├── Best until now = 1.8759 ([31m↗ 0.0586[0m)
    │   └── Epoch N-1      = 1.9501 ([32m↘ -0.0156[0m)
    ├── Ppyoloeloss/loss_cls = 1.0391
    │   ├── Best until now = 1.0148 ([31m↗ 0.0243[0m)
    │   └── Epoch N-1      = 1.0539 ([32m↘ -0.0149[0m)
    ├── Ppyoloeloss/loss_dfl = 0.7365
    │   ├── Best until now = 0.7035 ([31m↗ 0.033[0m)
    │   └── Epoch N-1      = 0.7315 ([31m↗ 0.005[0m)
    ├── Ppyoloeloss/loss_iou = 0.2109
    │   ├── Best until now = 0.2008 ([31m↗ 0.0101[0m)
    │   └── Epoch N-1      = 0.2122 ([32m↘ -0.0013[0m)
    ├── [email protected] = 0.2476
    │   ├── Best until now = 0.2744 ([31m↘ -0.0268[0m)
    │   └── Epoch N-1      = 0.2632 ([31m↘ -0.0156[0m)
    └── [email protected] = 0.7242
        ├── Best until now = 0.7649 ([31m↘ -0.0407[0m)
        └── Epoch N-1      = 0.7023 ([32m↗ 0.0219[0m)

===========================================================

MSSRPRAD avatar Jul 14 '23 09:07 MSSRPRAD

I had the same problem and it solved by changing my version of cuda from 11.8 to 10.2 , but i do not Know why @BloodAxe @Louis-Dupont @MSSRPRAD

SabraHashemi avatar Dec 18 '23 10:12 SabraHashemi

Hello @MSSRPRAD ..were you able to resolve this issue?.. Actually I am facing similar bug and my case is just like yours..So please suggest.. thanks

prernabhadwal avatar Jun 10 '24 04:06 prernabhadwal

Model_Training.txt

Attached is my model training file. I have got ~8000 images. i trained my model for 20 epochs as per above training file. Due to memory issues my model interrupted in between but i resumed training from checkpoint file loaded epoch. But when I try to test image it simply loads image without any bounding box. Please help

prernabhadwal avatar Jun 10 '24 06:06 prernabhadwal

Hello @MSSRPRAD ..were you able to resolve this issue?.. Actually I am facing similar bug and my case is just like yours..So please suggest.. thanks

Hello @prernabhadwal . I was not able to figure it out then. I have not worked on this for almost a year so might not be able to help.

MSSRPRAD avatar Jun 10 '24 06:06 MSSRPRAD

I used to encounter the same problem, but after I removed set_dataset_processing_params, the problem was solved.

BTW, if you encounter the error message below after you remove set_dataset_processing_params, you can review issue #1739 for more information.

RuntimeError: You must set the dataset processing parameters before calling predict.
Please call model.set_dataset_processing_params(...) first.

charlescwwang avatar Jun 27 '24 03:06 charlescwwang