sacremoses
sacremoses copied to clipboard
Add tokenization for Tetun Dili (tdt)
This PR copies from and replaces #114 which seems to be stale for more than 2 years, and also updates nonbreaking prefixes for Tetun Dili.
Hi, thanks for this addition!
Do you have some example sentences that trigger the added regular expressions and (ideally) some of the non-breaking prefixes unique to this language? In the future, I'd like to add tests for all supported languages so we can make sure we don't break/change anything by accident.
I don't speak Tetun Dili, so hope that these tests work as expected...
I noticed there's a test sentence in the original mosesdecoder pull request but when I try that it yields a different output on the Perl and the Python implementations. The original pull request (and what's currently in the moses tokenizer) is also different.
I'll dig a bit deeper to see whether I can find out why #114 decided to implement it differently, I'm tempted to stick to what's in the old Moses repo unless there's a very good reason not to.
Hi, any updates on this? Shall I close this PR? Or I can modify this PR to only update the nonbreaking prefixes.