Results 4 issues of Aran Komatsuzaki

Thank you very much for open-sourcing GShard! I'm currently using MoE from Mesh Tensorflow. The design of MoE used in MTF is equivalent to that of GShard iiuc. According to...

As I mentioned before, I'm working on applying AlphaZero to text generation using decoder-only Transformer instead of CNN. My implementation is nearly finished, but I haven't tested to see its...

As in the original T5 paper, we train T5 (and GPT for reference) with a varying number of epochs (e.g. 1, 8, 64, etc) on C4 and see at which...

In the final phase where you choose the best architecture based on their reward, the reward of ptb and cifar10 is set to be c/ppl + (entropy term) and accuracy,...