Regarding the Confusion about Ragged Tensors in the Documentation

Open yhyang201 opened this issue 10 months ago • 1 comments

In the "Fully Packed Layout (THD)" under Case 3 on this page, I noticed the following description:

`Q = aabb`  
dimension = [B = 2, H = 1, S = 8, D = 64]  
stride = [S × H × D = 512, D = 64, H × D = 64, 1]

What confuses me is that, despite using ragged_tensors, the dimensions still appear the same as they would be without ragged_tensors.
From my understanding, ragged_tensors should offer two key benefits:

Improved memory access efficiency (due to more compact data arrangement).
Memory savings (when sequences within a batch have varying lengths, ragged_tensors provide a more compact memory layout, as shown by the example Q=aabbb instead of Q[b=0]=aa000000, Q[b=1]=BBB00000).

However, in this case, the dimensions are still given as [B, H, S, D], which seems to suggest that the purpose of using ragged_tensors here is purely to improve memory access efficiency, without any memory savings. Could you kindly clarify whether my understanding is correct?

Mar 12 '25 12:03 yhyang201

Hi @yhyang201 Thanks for the question.

In the case of Ragged offset where the sequences are padded together, the graph API dimensions of query, key, value, output are indeed [B,S,H,D], but the sizes of these tensors are not sized that way.

Here, B denotes the number of sub-sequences, and S denotes the maximum sequence length. The size of the query, key, value, output tensors would be sized T * H * D, where T = sum(seqlen), and seqlen is an array sized B.

Thanks

Mar 25 '25 23:03 Anerudhan