bergamot-translator icon indicating copy to clipboard operation
bergamot-translator copied to clipboard

Sentence splitter non-breaking prefixes file

Open kpu opened this issue 4 years ago • 2 comments

The non-breaking prefixes file for the sentence splitter depends on the source language. We should bind this to the model somehow (i.e. by knowing what language it is translating). Otherwise the model will be confused when it sees the wrong sentence split and has a mismatch with training.

I'm beginning to think we should have a unified binary file like @XapaJIaMnu was suggesting.

kpu avatar Apr 19 '21 13:04 kpu

This is a problem. We're not consistent between training and test. We're also creating the impression to Mozilla that this file doesn't exist when it needs to, which will bite us later.

kpu avatar May 03 '21 13:05 kpu

Cleanest solution is probably to ship the file with the MT models. Or (and this is crazy) stuff it in the yaml somehow.

kpu avatar May 03 '21 13:05 kpu