Regarding the Confusion about Ragged Tensors in the Documentation
In the "Fully Packed Layout (THD)" under Case 3 on this page, I noticed the following description:
`Q = aabb`
dimension = [B = 2, H = 1, S = 8, D = 64]
stride = [S × H × D = 512, D = 64, H × D = 64, 1]
What confuses me is that, despite using ragged_tensors, the dimensions still appear the same as they would be without ragged_tensors.
From my understanding, ragged_tensors should offer two key benefits:
- Improved memory access efficiency (due to more compact data arrangement).
- Memory savings (when sequences within a batch have varying lengths, ragged_tensors provide a more compact memory layout, as shown by the example
Q=aabbbinstead ofQ[b=0]=aa000000, Q[b=1]=BBB00000).
However, in this case, the dimensions are still given as [B, H, S, D], which seems to suggest that the purpose of using ragged_tensors here is purely to improve memory access efficiency, without any memory savings. Could you kindly clarify whether my understanding is correct?
Hi @yhyang201 Thanks for the question.
In the case of Ragged offset where the sequences are padded together, the graph API dimensions of query, key, value, output are indeed [B,S,H,D], but the sizes of these tensors are not sized that way.
Here, B denotes the number of sub-sequences, and S denotes the maximum sequence length. The size of the query, key, value, output tensors would be sized T * H * D, where T = sum(seqlen), and seqlen is an array sized B.
Thanks