rf-detr icon indicating copy to clipboard operation
rf-detr copied to clipboard

Facing issue while finetuning the RF-DETR model for my specific object detection model for a single class

Open PRIYANKAMANN opened this issue 2 months ago • 4 comments
trafficstars

import multiprocessing as mp

================================================================================

-- FIX FOR WINDOWS --

This prevents the "PermissionError: [WinError 5] Access is denied"

by correctly handling multiprocessing on Windows systems.

================================================================================

if name == 'main': mp.set_start_method('spawn', force=True)

from rfdetr import RFDETRMedium
import os

# ================================================================================
#                             -- STEP 1: INITIALIZE MODEL --
# This builds the model's architecture with the correct 2-class prediction head.
# The 'num_classes' is for one object class ("cell") and one background class.
# The library will handle the mismatch between the pre-trained model (90 classes)
# and your dataset (1 class) gracefully.
# ================================================================================
print("Initializing RFDETR-Medium model for cell detection...")
num_your_classes = 2
model = RFDETRMedium(num_classes=num_your_classes)
print("Model initialized successfully.")

# ================================================================================
#                             -- STEP 2: START TRAINING --
# We now pass all the necessary training parameters. Note the inclusion of
# 'num_workers=0', which is crucial for stability on Windows to avoid
# the multiprocessing PermissionError.
# ================================================================================
print("\nStarting training...")
try:
    model.train(
        dataset_dir=r"C:\Users\DATASET",
        epochs=111,
        batch_size=8,
        grad_accum_steps=2,
        num_workers=0,  # CRITICAL: For Windows stability
        resolution=576,
        rect=True
    )
    print("Training finished.")
except Exception as e:
    print(f"An error occurred during training: {e}")                         

Fot this above training script I am getting this error Initializing RFDETR-Medium model for cell detection... Using a different number of positional encodings than DINOv2, which means we're not loading DINOv2 backbone weights. This is not a problem if finetuning a pretrained RF-DETR model. Using patch size 16 instead of 14, which means we're not loading DINOv2 backbone weights. This is not a problem if finetuning a pretrained RF-DETR model. Loading pretrain weights num_classes mismatch: pretrain weights has 90 classes, but your model has 2 classes reinitializing detection head with 90 classes

RuntimeError Traceback (most recent call last) Cell In[8], line 23 21 print("Initializing RFDETR-Medium model for cell detection...") 22 num_your_classes = 2 ---> 23 model = RFDETRMedium(num_classes=num_your_classes) 24 print("Model initialized successfully.") 26 # ================================================================================ 27 # -- STEP 2: START TRAINING -- 28 # We now pass all the necessary training parameters. Note the inclusion of 29 # 'num_workers=0', which is crucial for stability on Windows to avoid 30 # the multiprocessing PermissionError. 31 # ================================================================================

File ~\DATASET\rf-detr\rfdetr\detr.py:53, in RFDETR.init(self, **kwargs) 51 self.model_config = self.get_model_config(**kwargs) 52 self.maybe_download_pretrain_weights() ---> 53 self.model = self.get_model(self.model_config) 54 self.callbacks = defaultdict(list) 56 self.model.inference_model = None

File ~\DATASET\rf-detr\rfdetr\detr.py:201, in RFDETR.get_model(self, config) 197 def get_model(self, config: ModelConfig): 198 """ 199 Retrieve a model instance based on the provided configuration. ... RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. ALTHOUGH I HAVE TRIED A LOT TO FINE TUNE

PRIYANKAMANN avatar Sep 09 '25 05:09 PRIYANKAMANN

Can you run with CUDA_LAUNCH_BLOCKING and post the result?

isaacrob-roboflow avatar Sep 09 '25 14:09 isaacrob-roboflow

Yes I have tried with CUDA_LAUNCH_BLOCKING=1 But the main concern here is the num classes mismatch so is it possible to fine tune this model to our custom dataset for object detection for less num of class or How can I proceed can you suggest anything and I am working with VS code not in Colab is it may be a reason.

Initializing RFDETR-Medium model for cell detection... rf-detr-medium.pth: 100%|██████████| 386M/386M [00:24<00:00, 16.4MiB/s] Using a different number of positional encodings than DINOv2, which means we're not loading DINOv2 backbone weights. This is not a problem if finetuning a pretrained RF-DETR model. Using patch size 16 instead of 14, which means we're not loading DINOv2 backbone weights. This is not a problem if finetuning a pretrained RF-DETR model. Loading pretrain weights num_classes mismatch: pretrain weights has 90 classes, but your model has 2 classes reinitializing detection head with 90 classes

RuntimeError Traceback (most recent call last) Cell In[13], line 25 21 num_your_classes = 2 23 # FIX: Pass the class names directly to the model to ensure a correct mapping. 24 # This helps avoid the mismatch that likely caused the CUDA error. ---> 25 model = RFDETRMedium(num_classes=num_your_classes, classes=['cell', 'background']) 26 print("Model initialized successfully.") 28 # ================================================================================ 29 # -- STEP 2: START TRAINING -- 30 # Pass all the necessary training parameters. 31 # 'num_workers=0' is crucial for stability on Windows. 32 # ================================================================================

File ~\OneDrive - DATASET\Documents\Priyanka\rf-detr\rfdetr\detr.py:53, in RFDETR.init(self, **kwargs) 51 self.model_config = self.get_model_config(**kwargs) 52 self.maybe_download_pretrain_weights() ---> 53 self.model = self.get_model(self.model_config) 54 self.callbacks = defaultdict(list) 56 self.model.inference_model = None

File ~\OneDrive - DATASET\Priyanka\rf-detr\rfdetr\detr.py:201, in RFDETR.get_model(self, config) 197 def get_model(self, config: ModelConfig): 198 """ 199 Retrieve a model instance based on the provided configuration. ... RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

PRIYANKAMANN avatar Sep 09 '25 16:09 PRIYANKAMANN

I get it done but I am encountering another issue is that is this model can be used for rectangle images and if not does the zeropadding or other resizing helps with

PRIYANKAMANN avatar Sep 10 '25 16:09 PRIYANKAMANN

Since Initially after running a single epoch on my rectangular dataset it has the error Using a different number of positional encodings than DINOv2, which means we're not loading DINOv2 backbone weights. This is not a problem if finetuning a pretrained RF-DETR model. Using patch size 16 instead of 14, which means we're not loading DINOv2 backbone weights. This is not a problem if finetuning a pretrained RF-DETR model. Loading pretrain weights num_classes mismatch: model has 90 classes, but your dataset has 1 classes reinitializing your detection head with 1 classes. TensorBoard logging initialized. To monitor logs, use 'tensorboard --logdir output' and open http://localhost:6006/ in browser. Not using distributed mode git: sha: N/A, status: clean, branch: N/A

Namespace(num_classes=1, grad_accum_steps=2, amp=True, lr=0.0001, lr_encoder=0.00015, batch_size=8, weight_decay=0.0001, epochs=10, lr_drop=100, clip_max_norm=0.1, lr_vit_layer_decay=0.8, lr_component_decay=0.7, do_benchmark=False, dropout=0, drop_path=0.0, drop_mode='standard', drop_schedule='constant', cutoff_epoch=0, pretrained_encoder=None, pretrain_weights='rf-detr-medium.pth', pretrain_exclude_keys=None, pretrain_keys_modify_to_load=None, pretrained_distiller=None, encoder='dinov2_windowed_small', vit_encoder_num_layers=12, window_block_indexes=None, position_embedding='sine', out_feature_indexes=[3, 6, 9, 12], freeze_encoder=False, layer_norm=True, rms_norm=False, backbone_lora=False, force_no_pretrain=False, dec_layers=4, dim_feedforward=2048, hidden_dim=256, sa_nheads=8, ca_nheads=16, num_queries=300, group_detr=13, two_stage=True, projector_scale=['P4'], lite_refpoint_refine=True, num_select=300, dec_n_points=2, decoder_norm='LN', bbox_reparam=True, freeze_batch_norm=False, set_cost_class=2, set_cost_bbox=5, set_cost_giou=2, cls_loss_coef=1.0, bbox_loss_coef=5, giou_loss_coef=2, focal_alpha=0.25, aux_loss=True, sum_group_losses=False, use_varifocal_loss=False, use_position_supervised_loss=False, ia_bce_loss=True, dataset_file='roboflow', coco_path=None, dataset_dir='C:\Users\PMann\\Documents\Priyanka\dino_dataset\rfdetr_new_dataset', square_resize_div_64=True, output_dir='output', dont_save_weights=False, checkpoint_interval=10, seed=42, resume='', start_epoch=0, eval=False, use_ema=True, ema_decay=0.993, ema_tau=100, num_workers=2, device='cuda', world_size=1, dist_url='env://', sync_bn=True, fp16_eval=False, encoder_only=False, backbone_only=False, resolution=576, use_cls_token=False, multi_scale=True, expanded_scales=True, do_random_resize_via_padding=False, warmup_epochs=0, lr_scheduler='step', lr_min_factor=0.0, early_stopping=False, early_stopping_patience=10, early_stopping_min_delta=0.001, early_stopping_use_ema=False, gradient_checkpointing=False, patch_size=16, num_windows=2, positional_encoding_size=36, tensorboard=True, wandb=False, project=None, run=None, class_names=['cell'], run_test=True, distributed=False) number of params: 33363638 [736] loading annotations into memory... Done (t=0.02s) creating index... index created! [736] loading annotations into memory... Done (t=0.02s) creating index... index created! [736] loading annotations into memory... Done (t=0.02s) creating index... index created! Get benchmark Start training Grad accum steps: 2 Total batch size: 16 LENGTH OF DATA LOADER: 125 UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\TensorShape.cpp:3596.)

Epoch: [0] [ 0/125] eta: 1:21:28 lr: 0.000100 class_error: 0.00 loss: 17.7582 (17.7582) loss_ce: 0.0834 (0.0834) loss_bbox: 1.0564 (1.0564) loss_giou: 2.1016 (2.1016) loss_ce_0: 0.0711 (0.0711) loss_bbox_0: 1.6500 (1.6500) loss_giou_0: 2.1080 (2.1080) loss_ce_1: 0.0738 (0.0738) loss_bbox_1: 1.3171 (1.3171) loss_giou_1: 2.0839 (2.0839) loss_ce_2: 0.0839 (0.0839) loss_bbox_2: 1.1612 (1.1612) loss_giou_2: 2.0870 (2.0870) loss_ce_enc: 0.0695 (0.0695) loss_bbox_enc: 1.7164 (1.7164) loss_giou_enc: 2.0948 (2.0948) loss_ce_unscaled: 0.0834 (0.0834) class_error_unscaled: 0.0000 (0.0000) loss_bbox_unscaled: 0.2113 (0.2113) loss_giou_unscaled: 1.0508 (1.0508) cardinality_error_unscaled: 1.1250 (1.1250) loss_ce_0_unscaled: 0.0711 (0.0711) loss_bbox_0_unscaled: 0.3300 (0.3300) loss_giou_0_unscaled: 1.0540 (1.0540) cardinality_error_0_unscaled: 1.1250 (1.1250) loss_ce_1_unscaled: 0.0738 (0.0738) loss_bbox_1_unscaled: 0.2634 (0.2634) loss_giou_1_unscaled: 1.0419 (1.0419) cardinality_error_1_unscaled: 1.1250 (1.1250) loss_ce_2_unscaled: 0.0839 (0.0839) loss_bbox_2_unscaled: 0.2322 (0.2322) loss_giou_2_unscaled: 1.0435 (1.0435) cardinality_error_2_unscaled: 1.1250 (1.1250) loss_ce_enc_unscaled: 0.0695 (0.0695) loss_bbox_enc_unscaled: 0.3433 (0.3433) loss_giou_enc_unscaled: 1.0474 (1.0474) cardinality_error_enc_unscaled: 1.1250 (1.1250) time: 39.1085 data: 36.8820 max mem: 6468-------------------------------

Epoch: [0] [120/125] eta: 0:00:05 lr: 0.000100 class_error: 0.00 loss: 10.2071 (10.5338) loss_ce: 0.7563 (0.7639) loss_bbox: 0.0381 (0.0556) loss_giou: 1.2399 (1.2610) loss_ce_0: 0.7521 (0.7290) loss_bbox_0: 0.0386 (0.0670) loss_giou_0: 1.2309 (1.3295) loss_ce_1: 0.7705 (0.7622) loss_bbox_1: 0.0366 (0.0616) loss_giou_1: 1.2273 (1.2697) loss_ce_2: 0.7600 (0.7569) loss_bbox_2: 0.0378 (0.0578) loss_giou_2: 1.2508 (1.2734) loss_ce_enc: 0.7201 (0.6775) loss_bbox_enc: 0.0444 (0.0779) loss_giou_enc: 1.3034 (1.3908) loss_ce_unscaled: 0.7563 (0.7639) class_error_unscaled: 0.0000 (0.0000) loss_bbox_unscaled: 0.0076 (0.0111) loss_giou_unscaled: 0.6200 (0.6305) cardinality_error_unscaled: 1.1250 (1.1116) loss_ce_0_unscaled: 0.7521 (0.7290) loss_bbox_0_unscaled: 0.0077 (0.0134) loss_giou_0_unscaled: 0.6155 (0.6647) cardinality_error_0_unscaled: 1.1250 (1.1116) loss_ce_1_unscaled: 0.7705 (0.7622) loss_bbox_1_unscaled: 0.0073 (0.0123) loss_giou_1_unscaled: 0.6136 (0.6348) cardinality_error_1_unscaled: 1.1250 (1.1116) loss_ce_2_unscaled: 0.7600 (0.7569) loss_bbox_2_unscaled: 0.0076 (0.0116) loss_giou_2_unscaled: 0.6254 (0.6367) cardinality_error_2_unscaled: 1.1250 (1.1116) loss_ce_enc_unscaled: 0.7201 (0.6775) loss_bbox_enc_unscaled: 0.0089 (0.0156) loss_giou_enc_unscaled: 0.6517 (0.6954) cardinality_error_enc_unscaled: 1.1250 (1.1116) time: 0.8294 data: 0.0031 max mem: 8283

Epoch: [0] [124/125] eta: 0:00:01 lr: 0.000100 class_error: -0.00 loss: 10.5187 (10.5388) loss_ce: 0.7563 (0.7642) loss_bbox: 0.0367 (0.0550) loss_giou: 1.2472 (1.2630) loss_ce_0: 0.7359 (0.7285) loss_bbox_0: 0.0385 (0.0660) loss_giou_0: 1.3323 (1.3316) loss_ce_1: 0.7608 (0.7618) loss_bbox_1: 0.0355 (0.0607) loss_giou_1: 1.2524 (1.2719) loss_ce_2: 0.7600 (0.7573) loss_bbox_2: 0.0372 (0.0571) loss_giou_2: 1.3249 (1.2751) loss_ce_enc: 0.6874 (0.6779) loss_bbox_enc: 0.0440 (0.0767) loss_giou_enc: 1.3568 (1.3920) loss_ce_unscaled: 0.7563 (0.7642) class_error_unscaled: 0.0000 (-0.0000) loss_bbox_unscaled: 0.0073 (0.0110) loss_giou_unscaled: 0.6236 (0.6315) cardinality_error_unscaled: 1.1250 (1.1090) loss_ce_0_unscaled: 0.7359 (0.7285) loss_bbox_0_unscaled: 0.0077 (0.0132) loss_giou_0_unscaled: 0.6662 (0.6658) cardinality_error_0_unscaled: 1.1250 (1.1090) loss_ce_1_unscaled: 0.7608 (0.7618) loss_bbox_1_unscaled: 0.0071 (0.0121) loss_giou_1_unscaled: 0.6262 (0.6360) cardinality_error_1_unscaled: 1.1250 (1.1090) loss_ce_2_unscaled: 0.7600 (0.7573) loss_bbox_2_unscaled: 0.0074 (0.0114) loss_giou_2_unscaled: 0.6624 (0.6376) cardinality_error_2_unscaled: 1.1250 (1.1090) loss_ce_enc_unscaled: 0.6874 (0.6779) loss_bbox_enc_unscaled: 0.0088 (0.0153) loss_giou_enc_unscaled: 0.6784 (0.6960) cardinality_error_enc_unscaled: 1.1250 (1.1090) time: 0.8216 data: 0.0031 max mem: 8283 Epoch: [0] Total time: 0:02:24 (1.1527 s / it)

TypeError Traceback (most recent call last) Cell In[2], line 5 1 from rfdetr import RFDETRMedium 3 model = RFDETRMedium() ----> 5 model.train(dataset_dir= r"C:\Users\PMann\OneDrive - University of Maryland School of Medicine\Documents\Priyanka\dino_dataset\rfdetr_new_dataset", epochs=10, batch_size=8, grad_accum_steps=2)

File ~\AppData\Roaming\Python\Python312\site-packages\rfdetr\detr.py:81, in RFDETR.train(self, **kwargs) 77 """ 78 Train an RF-DETR model. 79 """ 80 config = self.get_train_config(**kwargs) ---> 81 self.train_from_config(config, **kwargs)

File ~\AppData\Roaming\Python\Python312\site-packages\rfdetr\detr.py:186, in RFDETR.train_from_config(self, config, **kwargs) 178 early_stopping_callback = EarlyStoppingCallback( 179 model=self.model, 180 patience=config.early_stopping_patience, 181 min_delta=config.early_stopping_min_delta, 182 use_ema=config.early_stopping_use_ema 183 ) 184 self.callbacks["on_fit_epoch_end"].append(early_stopping_callback.update) --> 186 self.model.train( 187 **all_kwargs, 188 callbacks=self.callbacks, ... 129 raise ValueError( 130 "Number of samples, %s, must be non-negative." % num 131 )

TypeError: 'numpy.float64' object cannot be interpreted as an integer and even after making my dtaset in square by adding the zero padding I am getting the same error.

PRIYANKAMANN avatar Sep 10 '25 17:09 PRIYANKAMANN