ml-aim
ml-aim copied to clipboard
Mismatches between ViT-H/14 in AIM and ViT-H/14 in MAE
AIM-600M:
def aim_600M(img_size: Union[int, Tuple[int, int]] = 224, **kwargs: Any) -> AIM:
preprocessor, trunk, head = _aim(
img_size=img_size,
patch_size=14,
embed_dim=1536,
num_blocks=24,
num_heads=12,
**kwargs,
)
return AIM(preprocessor, trunk, head)
https://github.com/apple/ml-aim/blob/0b1dea9128f4734ae89252078e65aa102999407a/aim/torch/models.py#L176-L185
MAE ViT-H/14:
def vit_huge_patch14(**kwargs):
model = VisionTransformer(
patch_size=14, embed_dim=1280, depth=32, num_heads=16, mlp_ratio=4, qkv_bias=True,
norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
return model
https://github.com/facebookresearch/mae/blob/efb2a8062c206524e35e47d04501ed4f544c0ae8/models_vit.py#L70-L74
The models have very different embedding dimensions, depth, and num_heads, and are incompatible with each other. However, in Tab. 6 of the paper, these two works share the same architecture in "Arch." column. Are the two architectures different, as it shows in the code? If so, it should probably be clarified in terms of the number of parameters in the paper.