Weiqian Chen

Results 3 issues of Weiqian Chen

should be attention = torch.softmax(energy / (self.head_dim ** (1 / 2)), dim=3)

I wonder why the Retrieval accuracy is almost 20% higher than the official JAX/FLAX implementation. As the paper says, "While we achieve consistent results reported in (Tay et al. 2020)...

Dear author, your work is very excellent! I'm very interested at your training scripts to do some experiments. Please release your whole codes.