alvations
alvations
The normalization bug in sacremoses happens here: - https://github.com/alvations/sacremoses/blob/master/sacremoses/normalize.py#L41 and - https://github.com/alvations/sacremoses/blob/master/sacremoses/normalize.py#L43
Thanks @j0hannes for catching this, #78 should fix it but it should be rechecked with the Moses decoder repo too.
After the #78 fix, your cleaning workflow for your input would be something like: 1. First normalize your input 2. Then detokenize it (that's assuming you know that the original...
Yes, the example I gave is one of the typical pipeline that people use to clean the data for machine translation. What's the expected output of in your example? Do...
Ah, do you mean something like: ```python >>> from sacremoses import MosesDetokenizer >>> md = MosesDetokenizer(lang='en') >>> text = "yesterday 's reception" >>> md.detokenize(text.split()) "yesterday's reception" ``` But with the...
Actually, this part on adding new apostrophe to the detokenization process isn't simple, https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L678 Because: - There's some smart quote counting happening - And the de-spacing of apostrophe might be...
Maybe try https://sjmielke.com/papers/tokenize/ or spacy for your use-case. I can take a look at this again without changing detokenization behavior but no promises. Because to support non-normalized text opens a...
I think rolling back to 7.0 for click would be better. After considering the different options.
Which model? Do you mean the truecasing model? Other than that, there's no real model training in sacremoses , it's lots of regex rules writing and testings =)
May I ask which preprocessing task are you referring to in sacremoses? The truecaser? For other tasks, there's no training involved and the rules are manually defined 😅