FasterNet
FasterNet copied to clipboard
Does using GELU or RELU have a critical impact on the performance of the T0 model?
I noticed that you use GELU in small models like T0 and T1 and RELU in larger models like T2, is this intentional or just an oversight?
Hi, as we said in the ablation study of the paper, for the activation function, we empirically found that GELU fits FasterNet-T0/T1 models more efficiently than ReLU. It, however, becomes the opposite for FasterNetT2/S/M/L. We conjecture that GELU strengthens FasterNet-T0/T1 by having higher non-linearity, while the benefit fades away for larger FasterNet variants.