RFDETRSegPreview ignores checkpoint args.resolution and keeps default 432 in model_config.resolution after retraining
Search before asking
- [x] I have searched the RF-DETR issues and found no similar bug report.
Bug
Hi
When I retrain RF-DETR Segmentation Preview with a custom resolution (e.g. 576), the checkpoint correctly stores args.resolution = 576, but when I later load that same checkpoint with RFDETRSegPreview(pretrain_weights=...), the model’s config still says 432.
This becomes a problem when calling .export() because the exported ONNX then “thinks” the model was trained at 432, so downstream tools (X-AnyLabeling in my case) also expect 432×432, not 576×576.
So the checkpoint and the loaded model contradict each other.
My actual goal is to export a custom RF-DETR Segmentation model to ONNX to use in X-AnyLabeling.
Steps to reproduce are below
Also I think I can not export model with resolution 576.
Even changing it in rfdetr_model.model_config.resolution will result with a model expecting input size 432x432.
Below is a code snippet and a screenshot from https://netron.app/
import torch
from rfdetr import RFDETRSegPreview
checkpoint= "./models/RFDETRSegPreviewV2.pth"
device = "cuda" if torch.cuda.is_available() else "cpu"
rfdetr_model = RFDETRSegPreview(pretrain_weights=checkpoint, device=device)
rfdetr_model.model_config.resolution = 576
rfdetr_model.export()
Environment
Windows 11 / Conda
Pip:
onnx==1.19.1
onnx_graphsurgeon==0.5.8
onnxruntime==1.15.1
onnxruntime-gpu==1.23.0
onnxsim==0.4.36
onnxslim==0.1.71
rfdetr==1.3.0
torch==2.8.0+cu129
Minimal Reproducible Example
Train RF-DETR Segmentation Preview with a non-default resolution, e.g. 576:
import os, torch
from rfdetr import RFDETRSegPreview
DATASET_DIR = os.path.abspath("./dataset/coco_merged")
OUT_DIR = os.path.abspath("./runs/coco_merged/RFDETRSegPreview")
device = "cuda" if torch.cuda.is_available() else "cpu"
weights_path = r".\runs\coco_merged\RFDETRSegPreview0\checkpoint_best_regular.pth"
model = RFDETRSegPreview(pretrain_weights=weights_path)
EPOCHS = 1 # I ran 20
LR = 1e-4
RES = 576
model.train(
dataset_dir=DATASET_DIR,
output_dir=OUT_DIR,
epochs=EPOCHS,
batch_size=4,
grad_accum_steps=1,
lr=LR,
resolution=RES,
device=device,
multi_scale=False,
expanded_scales=False,
run_test=False,
eval=False,
use_ema=False,
early_stopping=False,
gradient_checkpointing=False,
tensorboard=False,
wandb=False,
)
from rfdetr import RFDETRSegPreview
import torch
checkpoint="./runs/coco_merged/RFDETRSegPreview/checkpoint_best_regular.pth"
device = "cuda" if torch.cuda.is_available() else "cpu"
rfdetr_model = RFDETRSegPreview(pretrain_weights=checkpoint, device=device)
config = rfdetr_model.model_config
obj = torch.load(checkpoint, weights_only=False)
args = obj.get("args", None)
print(f"Loaded args resolution: {args.resolution}, config resolution: {config.resolution}")
Output:
Using a different number of positional encodings than DINOv2, which means we're not loading DINOv2 backbone weights. This is not a problem if finetuning a pretrained RF-DETR model.
Using patch size 12 instead of 14, which means we're not loading DINOv2 backbone weights. This is not a problem if finetuning a pretrained RF-DETR model.
Loading pretrain weights
Loaded args resolution: 576, config resolution: 432
→ contradiction
Additional
No response
Are you willing to submit a PR?
- [ ] Yes, I'd like to help by submitting a PR!
Hi @StefanNa3Shape, to curiosity what GPU you use how much VRAM I need to reproduce this train?
Initialize the model as below instead, and try again.
RFDETRSegPreview(pretrain_weights=checkpoint, device=device, resolution=576)
@Abdul-Mukit
So when I load it with RFDETRSegPreview(pretrain_weights=checkpoint, device=device, resolution=576) for training and load the weights from the .pth file with RFDETRSegPreview(pretrain_weights=checkpoint, device=device, resolution=576), then the exported .onnx model also has a resolution of 576.
@jvmedeirosr I use an RTX5090 with 32 GB VRAM Since you asked, I also noticed that while training I can define the VRAM to be used with batch size and grad_accum_steps. But when the training runs the test after each Epoch, my GPUs VRAM reaches 32GB and sometimes my entire PC freezes...
@StefanNa3Shape I'm appreciate your answer. I'm using 24Gb VRAM L40S on GCP btw with this configuration training file: from rfdetr import RFDETRSegPreview
model = RFDETRSegPreview(pretrained = True, resolution = 624 ,num_classes = 1)
model.train( dataset_dir = "./datasets", epochs = 300, batch_size = 2, grad_accum_steps = 4, use_ema = True, multi_scale=False, expanded_scales=False, run_test=False, ema_decay = 0.999, layer_norm = True, lr = 1e-4, output_dir = "./weights", early_stopping = True, tensorboard = True, ) @Abdul-Mukit, I have a question I'm searching in docs but i can't find. What exactly resolution parameter do when is loaded in the RFDETRSegPreview class or when called in train() method?
@jvmedeirosr not entirely sure.
Please see make_coco_transforms_square_div_64 in rfdetr/datasets/coco.py.
This guides the input resolution of the image during training and validation..
The other more important thing is class RFDETRSegPreviewConfig(RFDETRBaseConfig): and projector_scale: List[Literal["P3", "P4", "P5"]] = ["P4"] in config.py. Projector scale of P4 means that the output feature map out of the transformer-backbone is of image_resolution/(2^4) shape. So if the input image is of shape (432,432), the resultant feature map used as input for the segmentation head will be of spatial shape (27, 27). The lower the spatial resolution of the feature map, the more spatial information we likely lose.
You can see that in SegmentationHead.forward():
def forward(self, spatial_features: torch.Tensor, query_features: list[torch.Tensor], image_size: tuple[int, int], skip_blocks: bool=False) -> list[torch.Tensor]:
# spatial features: (B, C, H, W)
# query features: [(B, N, C)] for each decoder layer
# output: (B, N, H*r, W*r)
target_size = (image_size[0] // self.downsample_ratio, image_size[1] // self.downsample_ratio)
spatial_features = F.interpolate(spatial_features, size=target_size, mode='bilinear', align_corners=False)
The function tries to upsample to P2 (self.downsample_ratio is 4). So the function is basically trying to recover a higher resolution feature map P2 from the lower resolution feature map P4. A higher resolution feature map will likely result into more precise segmentation masks.
So by increasing resolution, you get better masks. But be careful of the time cost. RF-DETR produces 13,000 masks per input image during training. By increasing resolution, you are increasing the spatial size of all of the 13,000 masks. That is the reason VRAM usage saturates so easily. The model will become slower at higher resolution. That is my guess.