ml-fastvit icon indicating copy to clipboard operation
ml-fastvit copied to clipboard

[Bug] Include last attention layer in feature output

Open dillonalaird opened this issue 1 year ago • 0 comments

The last out indice should be 7 instead of 6, at least for the SA12 architecture. On SA12 if we return index 6 as the final layer it skips the last attention layer, while 7 includes it. The Timm implementation does include the final attention layer as output. I have trained both models for segmentation tasks on ADE20k using mmsegmentation with this configuration:

model = dict(
    type='EncoderDecoder',
    data_preprocessor=data_preprocessor,
    backbone=dict(
        type='FastViTSA12',
        pretrained=True,
    ),

    neck=dict(
        type='FPN',
        in_channels=[64, 128, 256, 512],
        out_channels=256,
        num_outs=4,
    ),

    decode_head=dict(
        type='FPNHead',
        in_channels=[256, 256, 256, 256],
        in_index=[0, 1, 2, 3],
        feature_strides=[4, 8, 16, 32],
        channels=128,
        dropout_ratio=0.1,
        num_classes=1,
        norm_cfg=norm_cfg,
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss',
            use_sigmoid=False,
            loss_weight=1.0,
        ),
    ),
)

Some of the differences are:

Model Parameters ADE20k Val mIoU
Apple FastViT SA12 FPN 8.3M 30 mIoU
Timm FastViT SA12 FPN 14.6M 39 mIoU

Using the final attention layer the performance numbers and size line up much more closely to the papers reported numbers.

dillonalaird avatar Nov 30 '23 19:11 dillonalaird