sacremoses
sacremoses copied to clipboard
"p.m." is not tokenized as in the original script.
I could not yet figure out why, but in the original script, the dot in p.m.
at the end of a sentence is not split up, while with this port it is.
The original script even explicitly leaves out p.m
from its nonbreaking prefixes, so i'd expect the behavior seen in the port.
The original script added that new hack that changed quite recently: https://github.com/moses-smt/mosesdecoder/pull/204
This difference isn't accounted for in sacremoses. And I'm really not sure whether we should or not.
Why sacremoses shouldn't include this?