marian-dev Quirky --max-length in marian-decoder

When decoding, if a sentence exceeds the --max-length, the decoder stops at the sentence before it. It's sort of a gotcha when the decoder file is always short of lines.

E.g. if there are 3000 lines in the file to be decoded and the 581th line is > 100, marian-decoder only outputs 580 lines.

cat ../wmt-data/en-de/test.en.tokenized.truecased.bped | ~/marian/build/marian-decoder \
-c model.npz.best-ce-mean-words.npz.decoder.yml \
-m model.npz.best-ce-mean-words.npz -v vocab.src.yml vocab.trg.yml \
-b 6 --max-length 100 --mini-batch 10 --maxi-batch 1000 \
-d 0 1 2 3 > test.de.bpe.out

And the user has to increase to --max-length 200 to get all lines decoded.

cat ../wmt-data/en-de/test.en.tokenized.truecased.bped | ~/marian/build/marian-decoder \
-c model.npz.best-ce-mean-words.npz.decoder.yml \
-m model.npz.best-ce-mean-words.npz -v vocab.src.yml vocab.trg.yml \
-b 6 --max-length 200 --mini-batch 10 --maxi-batch 1000 \
-d 0 1 2 3 > test.de.bpe.out

I think it's still okay since the default value for --max-length is 1000 so this would almost always work:

cat ../wmt-data/en-de/test.en.tokenized.truecased.bped | ~/marian/build/marian-decoder \
-c model.npz.best-ce-mean-words.npz.decoder.yml \
-m model.npz.best-ce-mean-words.npz -v vocab.src.yml vocab.trg.yml \
-b 6 --mini-batch 10 --maxi-batch 1000 \
-d 0 1 2 3 > test.de.bpe.out

but it was sort of a gotcha moment when the decoder file was always falling short of lines =)

Maybe some sort of message should be output if the no. of lines got cut-off halfway through?

Aug 24 '18 01:08 alvations

Add --max-length-crop , this will cut the line at the max length instead of omitting it.

Aug 24 '18 01:08 emjotde

Maybe I should check whether --max-length-crop vs --max-length 100000 makes a big difference.

Regardlessly, I guess omitting and returning \n is okay but stopping the decoding on the line is sort of quirky, esp. when the no. of lines in the output file is different from the input.

Aug 24 '18 01:08 alvations

Yeah, the reason for this is that the decoder is using the same input method as training and removing too long lines makes sense there. I can see your point, though.

Maybe we should enable cropping by default, but still output a warning during decoding?

Aug 24 '18 05:08 emjotde

@snukky Opinions?

Aug 24 '18 05:08 emjotde

An alternative to enabling --max-length-crop by default might be generating an empty translation if the decoder sees an input sentence exceeding --max-length, and I thought that it already does that.

I'm not sure what is better but in either case marian-decoder should produce a clear warning.

Aug 24 '18 06:08 snukky

Why is this needed for offline decoding anyway? How about just keeping the option in training, but not automatically propagating it to the decoder, and defaulting to SIZE_MAX in decoding (and fail if out of memory)?

Aug 24 '18 16:08 frankseide

BTW, what's the default SIZE_MAX?

Aug 24 '18 16:08 alvations

2^64-1

Aug 24 '18 16:08 frankseide

This is a limit on the source. You guys are forgetting that the encoder is computed in a single step. Since GPU RAM is not infinite it's not SIZE_MAX.

Aug 24 '18 18:08 emjotde

"(and fail if out of memory)"

Aug 24 '18 18:08 kpu

The usecase is for instance back-translation. It's a bit annoying to have that fail if you have stray long inputs around sentence 2 million.

Aug 24 '18 18:08 emjotde

The dominant use case is translating things. Where you want one line in and one line out. You also want that for backtranslation.

https://github.com/kpu/preprocess/blob/master/preprocess/remove_long_lines_main.cc

Aug 24 '18 19:08 kpu

Can you condense your preference into a paragraph?

Aug 24 '18 19:08 emjotde

Marian should preserve the principle of one line in, one line out. If you're backtranslating web data, the long line should have been removed from the input; there's nothing consistent Marian can do with it. If you want alternative behavior like a blank line (which could confuse downstream processing), that can be an option. My default would be to die if it's over memory, but translate if it's [max-length,memory).

Aug 24 '18 19:08 kpu

So your default would be do nothing for translation, same as Frank proposed. Fair enough.

We can leave --max-length and --max-length-crop in there anyway as optional settings.

Aug 24 '18 19:08 emjotde

Yes, my thumbs up on Frank's comment stands.

Aug 24 '18 19:08 kpu

OK. @snukky can you take care of this according to the will of the vocal majority?

Aug 24 '18 19:08 emjotde

Should we warn even if there is no enforced length limit?

Aug 24 '18 19:08 emjotde

I would go by what the dominant use case is. If that is translation, there should be a warning if a limit is set (!), as unnoticed truncation would affect your BELU score. If it is back-translation, there should be a warning if not set. Is there any heuristic for discovering whether an invocation is back-translation?

"Marian should preserve the principle of one line in, one line out." Absolutely, this should be non-negotiable. The entire data-file organization rests on the assumption that parallel data files are aligned line by line ("struct of arrays"). In our statistical world, off-by-one alignment errors towards the end of back-translated training data may cause subtle degradations that may go unnoticed (I've seen this once in my work on speech recognition).

"alternative behavior like a blank line" This would require Marian to special-case blank lines when reading back the back-translated input in the subsequent training. Are blank lines valid input under any circumstance? If it is, then one would have to pre-filter for blank lines, which is not much different than pre-filtering for too long input (both require an additional processing step, and to manage an additional version of the data). Instead of a blank line, we could consider writing out a special tag that is understood by Marian's reader (but other tools would not understand it).

Aug 24 '18 19:08 frankseide

I vote for no special handling of blank lines.

Aug 24 '18 20:08 kpu

I think blank lines are currently also not handled correctly during translation. They are ignored during training and probably also during translation. Another thing to be fixed. I agree they should just be handed through.

As for use-cases, back-translation is just translation, no way to identify.

Aug 24 '18 20:08 emjotde

Do we know the corpus size in advance? Then we could say, e.g., translating more than 10,000 sentences is likely back-translation, and then enable a warning.

Aug 24 '18 20:08 frankseide

Not when reading from STDIN.

Aug 24 '18 20:08 emjotde

I still contend that there's nothing different in these cases. If you're feeding long sentences into backtranslation, it's your fault for not cleaning it up before. Because otherwise you'll have to clean it up after which is messier.

Aug 24 '18 20:08 kpu

Yes, I can see this as the default behavior. Just asking if we should warn somehow?

Aug 24 '18 20:08 emjotde

I would support warning once per process.

Aug 24 '18 20:08 kpu

I am leaning against. Isn't setting a max length for back-translation something you forget once and never again?

Aug 24 '18 20:08 frankseide

@frankseide I'm worried about a user who remembers to set a max length, passes --max-length and is confused why it doesn't do anything.

But also not that strongly opinionated on whether a warning should exist.

Aug 24 '18 20:08 kpu

In my mind, if one says --max-length, it should do something; but if it is not set, it should default to SIZE_MAX i.e. do nothing. And then Marcin will forget it once in back-translation, waste a few hours, and never forget it again...

Aug 24 '18 20:08 frankseide

I will also pass on all angry issues to you :)

Aug 24 '18 21:08 emjotde

marian-dev marian-dev copied to clipboard

Quirky --max-length in marian-decoder

marian-dev
marian-dev copied to clipboard