PIE
PIE copied to clipboard
about the Edit-factorized BERT Architecture
for replace , when we calculate attention score of position i , we don't consider the token w(i).
at the first layer , I think it is no problem, but we use the info of w(i) indirectly at the seconder or upper layers.
Is it ok ?