firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Remove max_words filtering from data importers

Open eu9ene opened this issue 4 months ago • 4 comments

Any filtering should happen only in the cleaning stage (eventually in OpusCleaner). The max_words filtering on importing was originally a copy-paste from some random Bergamot bash script and was not needed at all. Even if we have some number of longer sentences that will later be cleaned, we can always compensate for that by adjusting max sentences in the config. We definitely don't want to deal with tokenization at this stage.

It's required for CJK.

closes #424

eu9ene avatar Oct 23 '24 22:10 eu9ene