Ze Liu
Results
32
comments of
Ze Liu
Hi @Luodian, yes, you need to set `gate_noise>0` for load_importance_loss. You can find the reasons in `APPENDICES A: LOAD-BALANCING LOSS` in the original paper (https://arxiv.org/pdf/1701.06538.pdf).
Hi @DavidZhang88, this is not a bug. By default, `qk_scale` is None, and `self.scale` is set to `head_dim ** -0.5`, which is consistent with "Attention is all you need". But...