super-gradients
super-gradients copied to clipboard
Upsample size mismatch in segmentation models
Describe the bug
Depending on the input image size, upsampled feature maps with nn.Upsample don't always match the size of the skip connection. This is a known issue, some reference links:
- https://github.com/pytorch/pytorch/issues/71877
- https://github.com/pytorch/pytorch/issues/7732
Replacing nn.Upsample with torch.nn.functional.interpolate seems to be the recommended solution.
To Reproduce
Here's a snippet using PP-LiteSeg. The dataset is cityscapes, but that's not important, the image size is the important factor. I imagine that the issue is in all models using nn.Upsample and concatenating with skip connections:
from super_gradients.training import models, dataloaders, Trainer
from super_gradients.common.object_names import Models
from super_gradients.training.metrics import IoU
trainer = Trainer(experiment_name="eval-pp-liteseg-b75")
val_loader = dataloaders.cityscapes_stdc_seg75_val(dataset_params={
"transforms": [
{
"SegRescale": {
"long_size": 1025
}
}
]
},
dataloader_params={"batch_size": 1},
)
model = models.get(
Models.PP_LITE_B_SEG75,
pretrained_weights="cityscapes",
)
metric = IoU(num_classes=20, ignore_index=19)
miou = trainer.test(
model=model,
test_loader=val_loader,
test_metrics_list=[metric],
metrics_progress_verbose=False
)[0].cpu().item()
print(f"mIoU: {miou}")
Results in an error:
File ".../src/super_gradients/training/models/segmentation_models/ppliteseg.py", line 52, in forward
atten = torch.cat([*self._avg_max_spatial_reduce(x, use_concat=False), *self._avg_max_spatial_reduce(skip, use_concat=False)], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 66 but got size 65 for tensor number 2 in the list.
Expected behavior
Fully convolutional segmentation models should work for all input image sizes.
Environment:
- Ubuntu
- super-gradients v3.0.7
- PyTorch 1.11
Hi! Thanks for raising this issue.
TLDR: One cannot feed arbitrary-sized image to the model.
I believe the root cause of the problem is that input image has a size that is not integer divisible by a maximum stride of the backbone (32). In this case backbone produces feature maps that has size that is not a power of two.
Indeed, explicitly specifying output size for upsample operations could patch this. However, this would work only for interpolation-based upsampling and not for nn.PixelShuffle or nn.ConvTranspose2D upsampling.
We definitely will look into it, but as of now I suggest to preprocess input images to have their size that is divisible by 32.
At least for nn.ConvTranspose2D there's output_padding to address this issue, see: https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html#convtranspose2d
Looks like for nn.PixelSuffle there's no way around it though, maybe that would be a good feature request for PyTorch.