hiersumm
hiersumm copied to clipboard
Question about residual connections
Hi
Thanks for sharing the code. After implementing Hierarchical Transformers, I found there is a difference in the residual connection between figure 2 and the code.
In here, the input to the PositionwiseFeedForward is already inputs + block_vec
. And these match figure 2 in the paper.
However, why perform an output + x
here ?
It looks like two residual connections.