sacremoses icon indicating copy to clipboard operation
sacremoses copied to clipboard

Add tokenization for Tetun Dili (tdt)

Open BLKSerene opened this issue 1 year ago • 4 comments

This PR copies from and replaces #114 which seems to be stale for more than 2 years, and also updates nonbreaking prefixes for Tetun Dili.

BLKSerene avatar Sep 25 '23 03:09 BLKSerene

Hi, thanks for this addition!

Do you have some example sentences that trigger the added regular expressions and (ideally) some of the non-breaking prefixes unique to this language? In the future, I'd like to add tests for all supported languages so we can make sure we don't break/change anything by accident.

jelmervdl avatar Sep 25 '23 14:09 jelmervdl

I don't speak Tetun Dili, so hope that these tests work as expected...

BLKSerene avatar Sep 25 '23 19:09 BLKSerene

I noticed there's a test sentence in the original mosesdecoder pull request but when I try that it yields a different output on the Perl and the Python implementations. The original pull request (and what's currently in the moses tokenizer) is also different.

I'll dig a bit deeper to see whether I can find out why #114 decided to implement it differently, I'm tempted to stick to what's in the old Moses repo unless there's a very good reason not to.

jelmervdl avatar Sep 27 '23 13:09 jelmervdl

Hi, any updates on this? Shall I close this PR? Or I can modify this PR to only update the nonbreaking prefixes.

BLKSerene avatar Apr 24 '24 07:04 BLKSerene