firefox-translations-training Improve implementation of alignments

Improve implementation of alignments

Open eu9ene opened this issue 10 months ago • 0 comments

Issues with the current implementation:

We use naive tokenization because it's what OpusTrainer requires. This might produce alignments of lower quality because we don't take into account punctuation and also the vocabulary for eflomal is getting very large. Ideally, we should switch somehow to Moses tokenization that also separates punctuation.
It's likely also more efficient and faster to process the Moses tokenized text due to smaller vocabulary
Because we don't do str.split() explicitly there might be some double spaces that might lead to discrepancies in tokenization of alignments and what OpusTrainer does
We can see some warnings while training [Trainer] [WARNING] Skipping line because of exception: ValueError('Out-of-bound alignment pairs') likely related to the different whitespace, but it requires further investigation. There are not too many of them.

Apr 01 '24 21:04 eu9ene