Empty line handling
In translate mode, when a file with empty lines is provided, JoeyNMT's error message is not very helpful:
File "/usr/local/lib/python3.7/dist-packages/joeynmt/tokenizers.py", line 81, in pre_process
assert raw_input is not None and len(raw_input) > 0, raw_input
AssertionError
Perhaps one could simply skip the line, or output a warning or more informative error message. I'm not sure if the other modes are ready to handle empty lines, haven't tested it yet.
Here an example:

-
background: An empty line raises an error in sacrebleu. maybe need to skip empty lines before evaluation??
-
If we remove empty lines internally, input line numbers and output line numbers in test will be different. For instance, sentence in line 100 of src file will not be aligned to the sentence in line 100 in trg.
cf. ) in plaintext dataset, an empty line will be skiped in data loading. https://github.com/joeynmt/joeynmt/blob/6c580f838ad19ea22222a8ea14ca0a371533aef4/joeynmt/datasets.py#L154-L159
Basically, we should do this in parallel both for src and trg if trg is given. Otherwise, the number of sequences becomes different between src and trg if only src/trg contains an empty line. (In file stream input, we only have src and no trg, so it doesn't matter, maybe...)
-
need this empty line handling for all dataset types, including file streams.
Yes, very good points. My concern was mostly about the translate mode, where we don't have targets, and also no sacrebleu computation. We don't want to filter, but we want the user to know that there's an empty line problem. What do you think about just raising an assertion with a more informative error message?
@juliakreutzer
What do you think about just raising an assertion with a more informative error message?
yes, that sounds reasonable. I'll write an error message, then.
Currently, the error occurs in tokenization after the training/prediction loop has started, but we can raise an error in data loading, before the minibatches are constructed.
Note: the same assertion error can happen, when the model generate an empty string (i.e. special symbol only, such as <unk> + </s>). Need to handle this not only in the data loading but also in prediction.
(alternatively, set generate unk: False and min_output_length > 2 in testing config.)
more informative error message in v2.1.0: https://github.com/joeynmt/joeynmt/blob/32eef89e14355a9e21fb2671ed401ff3108f2c84/joeynmt/tokenizers.py#L64-L73
FYI @juliakreutzer