firefox-translations-training
firefox-translations-training copied to clipboard
Remove max_words filtering from data importers
Any filtering should happen only in the cleaning stage (eventually in OpusCleaner). The max_words filtering on importing was originally a copy-paste from some random Bergamot bash script and was not needed at all. Even if we have some number of longer sentences that will later be cleaned, we can always compensate for that by adjusting max sentences in the config. We definitely don't want to deal with tokenization at this stage.
It's required for CJK.
closes #424