Guidance on Improving Segmentation Model Performance Across 4K and 1K Resolutions
Hi @ZhengPeng7,
We have trained a car background removal (segmentation) model using a dataset of 4K images with a resolution of 4032x3024. The model performs well on the 4K testing dataset and also works reasonably well on 1K images (1365x1024) for most cases. However, we have observed that for certain images, the 4K version produces a clean and accurate background removal, while the same image resized to 1K yields results that are noticeably less precise.
To illustrate this, I have attached four files:
• small_car1.png: 1K image (1365x1024)
• small_car1_mask.png: Mask for the 1K image
• large_car1.png: 4K image (4032x3024)
• large_car1_mask.png: Mask for the 4K image
As you can see, there are subtle but significant differences, especially in the finer details of the car.
Questions: 1. Could you share insights on why the model might produce less accurate results when the input image is downscaled to 1K, even though augmentation with resizing was part of training? 2. From an approach perspective, should we take our current model (trained on 4K images) and fine-tune it using the same dataset downscaled to 1K? Would this improve accuracy for 1K inputs? 3. Alternatively, is there a better method to ensure consistent performance across both 4K and 1K resolutions?
Your advice on this would be greatly appreciated.
Larger images tend to allow models to focus more on local regions, such as considering different parts of the car as separate parts. This makes the glass transparent around the driver's head.
- And which model are you using? If it's the default BiRefNet, 1024x1024 should be the preferred resolution. If not, that effect is reasonable. You can also try the BiRefNet_dynamic, which uses inputs with dynamic resolutions in training.
- Yeah, sure. Fine-tuning would improve a lot since your cases are limited in certain scenarios. In most cases, 1024x1024 can already handle these cases in a good way. But if possible, 2048x2048 should be even better for your images in 4K resolutions. You can use the weights provided by me for fine-tuning, following my video tutorial for fine-tuning.
-
BiRefNet_dynamic has already done it, with resolution from
256x256to2048+256 x 2048+256during training. You can take a deep look at the README there and have a try in my online HF demo.
- We have used BiRefNet-general-epoch_244.pth as a base model to fine-tune our 4K model with default BiRefNet resolution 1024x1024.
- We fine-tuned BiRefNet-general-epoch_244.pth for an additional 250 epochs using 23,000 4K images, resulting in a new fine-tuned model, BiRefNet-general-epoch_494.pth.
We chose 4K images because we expected that production inference would be performed on 4K images. However, we have now learned that production inference will actually be performed on 1K images (which we were not aware of earlier), and 1K images have issues as identified above.
Considering this, as a solution, can I further train my BiRefNet-general-epoch_494.pth model and fine-tune it using the same 4k dataset but downscaled to 1K? Would this improve accuracy for 1K inputs?
- Okay, thanks.
Note: At this stage of project we are not in position to change our approach to BiRefNew_dynamic
Yes, certainly. I also used BiRefNet-general for fine-tuning to obtain the BiRefNet_dynamic. It's okay to load a pre-trained model trained in a certain resolution and fine-tune it with another one.
Hello @kparikh-tenup, are you able to achieve complete background removal i.e through windows of a car. If so would you mind sharing insights on your finetuning process? If you would prefer I can reach out via email as to not clutter this thread (mine is [email protected])