joeynmt icon indicating copy to clipboard operation
joeynmt copied to clipboard

Empty line handling

Open juliakreutzer opened this issue 3 years ago • 5 comments

In translate mode, when a file with empty lines is provided, JoeyNMT's error message is not very helpful:

File "/usr/local/lib/python3.7/dist-packages/joeynmt/tokenizers.py", line 81, in pre_process
    assert raw_input is not None and len(raw_input) > 0, raw_input
AssertionError

Perhaps one could simply skip the line, or output a warning or more informative error message. I'm not sure if the other modes are ready to handle empty lines, haven't tested it yet.

Here an example: image

juliakreutzer avatar Jun 14 '22 01:06 juliakreutzer

  • background: An empty line raises an error in sacrebleu. maybe need to skip empty lines before evaluation??

  • If we remove empty lines internally, input line numbers and output line numbers in test will be different. For instance, sentence in line 100 of src file will not be aligned to the sentence in line 100 in trg.

    cf. ) in plaintext dataset, an empty line will be skiped in data loading. https://github.com/joeynmt/joeynmt/blob/6c580f838ad19ea22222a8ea14ca0a371533aef4/joeynmt/datasets.py#L154-L159

    Basically, we should do this in parallel both for src and trg if trg is given. Otherwise, the number of sequences becomes different between src and trg if only src/trg contains an empty line. (In file stream input, we only have src and no trg, so it doesn't matter, maybe...)

  • need this empty line handling for all dataset types, including file streams.

may- avatar Jun 14 '22 05:06 may-

Yes, very good points. My concern was mostly about the translate mode, where we don't have targets, and also no sacrebleu computation. We don't want to filter, but we want the user to know that there's an empty line problem. What do you think about just raising an assertion with a more informative error message?

juliakreutzer avatar Aug 31 '22 13:08 juliakreutzer

@juliakreutzer

What do you think about just raising an assertion with a more informative error message?

yes, that sounds reasonable. I'll write an error message, then.
Currently, the error occurs in tokenization after the training/prediction loop has started, but we can raise an error in data loading, before the minibatches are constructed.

may- avatar Aug 31 '22 15:08 may-

Note: the same assertion error can happen, when the model generate an empty string (i.e. special symbol only, such as <unk> + </s>). Need to handle this not only in the data loading but also in prediction. (alternatively, set generate unk: False and min_output_length > 2 in testing config.)

may- avatar Sep 07 '22 05:09 may-

more informative error message in v2.1.0: https://github.com/joeynmt/joeynmt/blob/32eef89e14355a9e21fb2671ed401ff3108f2c84/joeynmt/tokenizers.py#L64-L73

FYI @juliakreutzer

may- avatar Sep 19 '22 08:09 may-