Paul Tardy
Paul Tardy
Actually the coverage mechanism isn't implemented for transformer decoders. Coverage comes from See 2018 which is based on RNNs instead (LSTM actually), therefore a single attention head. It's not clear...
Well at least it gives some guidelines to implement coverage in the Transformer. Feel free to implement this paper and open a PR we would review it. Results show some...
@Qnlp Absolutely. And better results as well. Transformers has many heads, has encoder, decoder AND cross-attention (instead of a single cross-attention layer in RNNs) so it may generalize the concept...
@mahimanzum sorry, didnt checked github notifs for a while. There's no option to do it directly in OpenNMT-py, it would require a bit of tweaks. First, the attention weights are...
@flauted do you get normal scores without coverage penalty? The problem is probably not about beta tho.
Could you try with the parameters of In particular using `-coverage_penalty summary`
Would be interesting to check predictions scores, in particular to look how much sentences get `inf` scores
It actually seems like a mistake to me, `a^t` being included in the summation we have `min(a^t, c^t) = a^t` which does not really make sense.
Ok it make sense. I found some results where the difference was around 9 rouge points (on 11.5k sentences) which is not close at all. I maybe did a mistake...
Sorry for delay, could you open a PR on that? Thanks