marian
marian copied to clipboard
marian_decoder starting and ending logic
I was inspecting intermediate values of the output tensor transformer.h
, while running marian_decoder
, and noticed that the first step through the decoder some sort of token is passed that has 0 word embedding.
Q1) What token is used as a prefix? Are there tricks to make it's embedding 0?
Q2) How does the decoder know to terminate a translation? In my python port of the opus-nmt models, the decoder never predicts ''.
Additional Clues
My python port of the opus-nmt models works nicely when english is the source language, and just generates a dummy token when it is done translating.
For fr-en, it generates nonsense at the beginning of the generation, whereas marian-decoder
generates no nonsense at all :)
sample_text = "Donnez moi le micro ."
my_result = ', uh... give me the microphone .'] # after constraining max_length.
marian_decoder = 'Give me the microphone!' # after sentencepiece
Thanks in Advance!
Q1: The embedding of the sentence-start (BOS
or <s>
) context is hard-coded to be 0. It is not copied from the embedding matrix. I always felt that's a bug, but anecdotally, it makes no accuracy difference.
Q2: Each beam hypothesis that ends in EOS
(or </s>
) will cease to be expanded. Once all hyps for a sentence end in EOS
, sentence translation is complete.