ringattention
ringattention copied to clipboard
Questions about the paper
First, great work! I read the paper and had a few questions.
- On p. 5, the paper says that minimal sequence length
s = 6c
, but where does this 6 come from? Is this related to6bch
for the blocks memory? - About the memory requirement, if I understand correctly, the total memory for 6 blocks might be
12bch
(instead of6bch
) because each data is bfloat16? - Possibly, the interconnect bandwidth for TPUs might be wrong? According to https://cloud.google.com/blog/products/ai-machine-learning/introducing-cloud-tpu-v5p-and-ai-hypercomputer?hl=en (the table), ICI BW per chip is 2,400Gbps. My understanding is that this is the total of 6 links (to form 3D torus), so each link is 400Gbps or 50GB/s. Let me know if this interpretation is wrong.