firefox-translations-training
firefox-translations-training copied to clipboard
Improve implementation of alignments
Issues with the current implementation:
-
We use naive tokenization because it's what OpusTrainer requires. This might produce alignments of lower quality because we don't take into account punctuation and also the vocabulary for eflomal is getting very large. Ideally, we should switch somehow to Moses tokenization that also separates punctuation.
-
It's likely also more efficient and faster to process the Moses tokenized text due to smaller vocabulary
-
Because we don't do str.split() explicitly there might be some double spaces that might lead to discrepancies in tokenization of alignments and what OpusTrainer does
-
We can see some warnings while training
[Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs')
likely related to the different whitespace, but it requires further investigation. There are not too many of them.