unilm
unilm copied to clipboard
[DiT] Question regarding image resolutions on different tasks
First of all, thanks for your general great work on document AI and specifically DiT in this case.
In the paper, it says
Since the image resolution for object detection tasks is much larger than classification, we limit the batch size to 16.
which confuses me. If I understand correctly, when using a pretrained version of DiT, one is ALWAYS limited by the 224x224 image resolution, since this is constrained by the size of the patch embeddings (similar to how e.g. BERT-base simply can't go beyond 512 tokens due to the position embeddings). So regardless of the original size of the image, the input the model gets is always limited to this predefined 224x224.
IF this reasoning is correct, then I cannot comprehend the logic behind resizing random crops of an image as described in the paper:
Specifically, the input image is cropped with probability 0.5 to a random rectangular patch which is then resized again such that the shortest side is at least 480 and at most 800 pixels while the longest at most 1,333.
Any clarification for this would be very much appreciated, thanks in advance!
@NielsRogge sorry to bother you out of the blue, but since you've done so much of the groundwork of making this research easily available, I have good hopes your insights can show me the flaws in my reasoning..
@mrvoh Do you know the reason now?
Hi,
Note that one can interpolate the pre-trained position embeddings to use the model at higher resolutions. This is probably what the authors did to use the model at a different resolution compared to pre-training.