saint
saint copied to clipboard
[def intersample() question]
When reshaping queries, keys, and values in the intersample() function, shouldn’t they be changed to (1, h, b, n*d)?
The code has a structure in which 8 heads per batch (records) self-attention, but the contents of the paper are self-attention by batches per head.
Please reply!