sacremoses
sacremoses copied to clipboard
Python port of Moses tokenizer, truecaser and normalizer
As Python 2 support has been officially dropped since 0.0.41 (cf. #94), this pull request cleans up all Python 2 compatibility-related codes and gets rid of the `six` dependency to...
Compiles regexs where appropriate for improved perf for common operations (subs, searches, matches, finditers). Timeit info below for a microbenchmark (`MT1` is original w/o compilation, `MT2` is new w/ compilation...
## Description Add tokenization support for the Tetun language (tdt). Equivalent of https://github.com/moses-smt/mosesdecoder/pull/224 Tetun has words that contain apostrophes (e.g "me" in Tetun is "ha'u"). The logic here will keep...
Many projects use versions of `click` that aren't `8.0`, and we should allow that. Ideally, we should also figure out what caused `8.1.3` to break. Also, apparently `8.0.0` is broken.
old version will cause 'int' + ‘str' error and is not what mosesdecoder actually does.
``` (掌声)这个是,盛装舞步。 ``` The result generated by command `sacremoses -l zh -j 4 tokenize < input > output` is ``` ( 掌声 ) 这个是 , 盛装舞步。 ``` I think it...
Just wanted to have a reference here to https://huggingface.co/dsilin/detok-deberta-xl It's a deep detokenizer trained to reverse sacremoses (html unescaped) If enough people find it useful maybe a footnote in the...
text = "will not be the true meaning. always remember that our mind" print(moses_tokenizer.tokenize(text, escape=False)) I get the following output ['will', 'not', 'be', 'the', 'true', 'meaning.', 'always', 'remember', 'that', 'our',...
Hi there, Moses has a `detokenize_penn()` method in Perl but I can't find it here. This means that things like `wo n't` or `ca n't` can't be detokenised. Any chance...