ml-fastvit
ml-fastvit copied to clipboard
[Bug] Include last attention layer in feature output
The last out indice should be 7 instead of 6, at least for the SA12 architecture. On SA12 if we return index 6 as the final layer it skips the last attention layer, while 7 includes it. The Timm implementation does include the final attention layer as output. I have trained both models for segmentation tasks on ADE20k using mmsegmentation with this configuration:
model = dict(
type='EncoderDecoder',
data_preprocessor=data_preprocessor,
backbone=dict(
type='FastViTSA12',
pretrained=True,
),
neck=dict(
type='FPN',
in_channels=[64, 128, 256, 512],
out_channels=256,
num_outs=4,
),
decode_head=dict(
type='FPNHead',
in_channels=[256, 256, 256, 256],
in_index=[0, 1, 2, 3],
feature_strides=[4, 8, 16, 32],
channels=128,
dropout_ratio=0.1,
num_classes=1,
norm_cfg=norm_cfg,
align_corners=False,
loss_decode=dict(
type='CrossEntropyLoss',
use_sigmoid=False,
loss_weight=1.0,
),
),
)
Some of the differences are:
Model | Parameters | ADE20k Val mIoU |
---|---|---|
Apple FastViT SA12 FPN | 8.3M | 30 mIoU |
Timm FastViT SA12 FPN | 14.6M | 39 mIoU |
Using the final attention layer the performance numbers and size line up much more closely to the papers reported numbers.