matmulfreellm
matmulfreellm copied to clipboard
MLGRU
From the computational formula of MLGRU, it is observed that the parallelism between tokens is disrupted during the prefill phase, whereas Transformer++ is able to maintain the parallelism between tokens, and I have two questions:
- latency in Figure 4->(d) means First token latency?
- And in Figure 4->(d) , Transfomer++ utilizes token parallelism?
@AACengineer Hi, Transformer++ conducts decoding also in an autoregressive manner. During training, Transformer++ can be fully parallelized. However, we can also make use of the parallel scan to improve the token parallelism. And cuz the linear-time GRU requires much less FLOPs than self attn, our training efficiency can be much better. Also, GRU does not need KV cache, the decoding space complexity is O(1).