PosEmbed is different for each block - shouldn't it be the same!?
Reading https://github.com/google-research/nested-transformer/blob/main/models/nest_net.py#L89 and https://github.com/google-research/nested-transformer/blob/main/libml/self_attention.py#L225 it's clear that the PositionEmbedding is done over three not two axes.
This is further corroborated by the pytorch clone: https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/nest.py#L204
So it's taken over (blocks, seqlen, d), not (1, seqlen, d), which means that each block (= part of the image) gets its own set of pos embeddings.
Is this intentional? It seems to run counter to the spirit of the paper?!
With (1, seqlen, d) there would be more of an inductive bias, right? With (blocks, seqlen, d) the earlier stages can "cheat" and do different things in different blocks (= parts of) the image?
And more importantly, the separate blocks don't boost each other's modeling of the pos embedding - wouldn't you expect even faster convergence (arguably the paper's main achievement) with (1, seqlen, d)?
Thanks!
(This only matters on the first two levels in the hierarchy where #blocks > 1)