Does using GELU or RELU have a critical impact on the performance of the T0 model?

Open MenSanYan opened this issue 2 years ago • 1 comments

I noticed that you use GELU in small models like T0 and T1 and RELU in larger models like T2, is this intentional or just an oversight?

Mar 22 '23 09:03 MenSanYan

Hi, as we said in the ablation study of the paper, for the activation function, we empirically found that GELU fits FasterNet-T0/T1 models more efficiently than ReLU. It, however, becomes the opposite for FasterNetT2/S/M/L. We conjecture that GELU strengthens FasterNet-T0/T1 by having higher non-linearity, while the benefit fades away for larger FasterNet variants.

Mar 22 '23 10:03 JierunChen