sacremoses icon indicating copy to clipboard operation
sacremoses copied to clipboard

"p.m." is not tokenized as in the original script.

Open pypae opened this issue 6 years ago • 2 comments

I could not yet figure out why, but in the original script, the dot in p.m. at the end of a sentence is not split up, while with this port it is.

The original script even explicitly leaves out p.m from its nonbreaking prefixes, so i'd expect the behavior seen in the port.

pypae avatar Jan 22 '19 18:01 pypae

The original script added that new hack that changed quite recently: https://github.com/moses-smt/mosesdecoder/pull/204

This difference isn't accounted for in sacremoses. And I'm really not sure whether we should or not.

alvations avatar Jan 25 '19 07:01 alvations

Why sacremoses shouldn't include this?

ZJaume avatar Jun 04 '20 13:06 ZJaume