context_recommendation
context_recommendation copied to clipboard
About multi-head
The article describes that the k hidden units in the middle of the multi-head AutoEncoder can obtain local context information, but why each hidden unit uses all the input mask(x), which is equivalent to calculating the traditional DAE K times. Does this make any sense?