attention-is-all-you-need-keras
attention-is-all-you-need-keras copied to clipboard
maybe i find a point should be change
self.target_layer = TimeDistributed(Dense(o_tokens.num(), use_bias=False)) change to: self.target_layer = TimeDistributed(Dense(o_tokens.num(), activation='softmax', use_bias=False))
it's very interesting, when i user softmax as proposed in paper, the loss can not down
The tf loss contains a softmax. In fact, you do softmax twice.