marian-dev
marian-dev copied to clipboard
Quirky --max-length in marian-decoder
When decoding, if a sentence exceeds the --max-length
, the decoder stops at the sentence before it. It's sort of a gotcha when the decoder file is always short of lines.
E.g. if there are 3000 lines in the file to be decoded and the 581th line is > 100, marian-decoder
only outputs 580 lines.
cat ../wmt-data/en-de/test.en.tokenized.truecased.bped | ~/marian/build/marian-decoder \
-c model.npz.best-ce-mean-words.npz.decoder.yml \
-m model.npz.best-ce-mean-words.npz -v vocab.src.yml vocab.trg.yml \
-b 6 --max-length 100 --mini-batch 10 --maxi-batch 1000 \
-d 0 1 2 3 > test.de.bpe.out
And the user has to increase to --max-length 200
to get all lines decoded.
cat ../wmt-data/en-de/test.en.tokenized.truecased.bped | ~/marian/build/marian-decoder \
-c model.npz.best-ce-mean-words.npz.decoder.yml \
-m model.npz.best-ce-mean-words.npz -v vocab.src.yml vocab.trg.yml \
-b 6 --max-length 200 --mini-batch 10 --maxi-batch 1000 \
-d 0 1 2 3 > test.de.bpe.out
I think it's still okay since the default value for --max-length
is 1000 so this would almost always work:
cat ../wmt-data/en-de/test.en.tokenized.truecased.bped | ~/marian/build/marian-decoder \
-c model.npz.best-ce-mean-words.npz.decoder.yml \
-m model.npz.best-ce-mean-words.npz -v vocab.src.yml vocab.trg.yml \
-b 6 --mini-batch 10 --maxi-batch 1000 \
-d 0 1 2 3 > test.de.bpe.out
but it was sort of a gotcha moment when the decoder file was always falling short of lines =)
Maybe some sort of message should be output if the no. of lines got cut-off halfway through?
Add --max-length-crop , this will cut the line at the max length instead of omitting it.
Maybe I should check whether --max-length-crop
vs --max-length 100000
makes a big difference.
Regardlessly, I guess omitting and returning \n
is okay but stopping the decoding on the line is sort of quirky, esp. when the no. of lines in the output file is different from the input.
Yeah, the reason for this is that the decoder is using the same input method as training and removing too long lines makes sense there. I can see your point, though.
Maybe we should enable cropping by default, but still output a warning during decoding?
@snukky Opinions?
An alternative to enabling --max-length-crop
by default might be generating an empty translation if the decoder sees an input sentence exceeding --max-length
, and I thought that it already does that.
I'm not sure what is better but in either case marian-decoder should produce a clear warning.
Why is this needed for offline decoding anyway? How about just keeping the option in training, but not automatically propagating it to the decoder, and defaulting to SIZE_MAX in decoding (and fail if out of memory)?
BTW, what's the default SIZE_MAX
?
2^64-1
This is a limit on the source. You guys are forgetting that the encoder is computed in a single step. Since GPU RAM is not infinite it's not SIZE_MAX.
"(and fail if out of memory)"
The usecase is for instance back-translation. It's a bit annoying to have that fail if you have stray long inputs around sentence 2 million.
The dominant use case is translating things. Where you want one line in and one line out. You also want that for backtranslation.
https://github.com/kpu/preprocess/blob/master/preprocess/remove_long_lines_main.cc
Can you condense your preference into a paragraph?
Marian should preserve the principle of one line in, one line out. If you're backtranslating web data, the long line should have been removed from the input; there's nothing consistent Marian can do with it. If you want alternative behavior like a blank line (which could confuse downstream processing), that can be an option. My default would be to die if it's over memory, but translate if it's [max-length,memory).
So your default would be do nothing for translation, same as Frank proposed. Fair enough.
We can leave --max-length
and --max-length-crop
in there anyway as optional settings.
Yes, my thumbs up on Frank's comment stands.
OK. @snukky can you take care of this according to the will of the vocal majority?
Should we warn even if there is no enforced length limit?
I would go by what the dominant use case is. If that is translation, there should be a warning if a limit is set (!), as unnoticed truncation would affect your BELU score. If it is back-translation, there should be a warning if not set. Is there any heuristic for discovering whether an invocation is back-translation?
"Marian should preserve the principle of one line in, one line out." Absolutely, this should be non-negotiable. The entire data-file organization rests on the assumption that parallel data files are aligned line by line ("struct of arrays"). In our statistical world, off-by-one alignment errors towards the end of back-translated training data may cause subtle degradations that may go unnoticed (I've seen this once in my work on speech recognition).
"alternative behavior like a blank line" This would require Marian to special-case blank lines when reading back the back-translated input in the subsequent training. Are blank lines valid input under any circumstance? If it is, then one would have to pre-filter for blank lines, which is not much different than pre-filtering for too long input (both require an additional processing step, and to manage an additional version of the data). Instead of a blank line, we could consider writing out a special tag that is understood by Marian's reader (but other tools would not understand it).
I vote for no special handling of blank lines.
I think blank lines are currently also not handled correctly during translation. They are ignored during training and probably also during translation. Another thing to be fixed. I agree they should just be handed through.
As for use-cases, back-translation is just translation, no way to identify.
Do we know the corpus size in advance? Then we could say, e.g., translating more than 10,000 sentences is likely back-translation, and then enable a warning.
Not when reading from STDIN.
I still contend that there's nothing different in these cases. If you're feeding long sentences into backtranslation, it's your fault for not cleaning it up before. Because otherwise you'll have to clean it up after which is messier.
Yes, I can see this as the default behavior. Just asking if we should warn somehow?
I would support warning once per process.
I am leaning against. Isn't setting a max length for back-translation something you forget once and never again?
@frankseide I'm worried about a user who remembers to set a max length, passes --max-length
and is confused why it doesn't do anything.
But also not that strongly opinionated on whether a warning should exist.
In my mind, if one says --max-length
, it should do something; but if it is not set, it should default to SIZE_MAX i.e. do nothing. And then Marcin will forget it once in back-translation, waste a few hours, and never forget it again...
I will also pass on all angry issues to you :)