sacremoses icon indicating copy to clipboard operation
sacremoses copied to clipboard

Python port of Moses tokenizer, truecaser and normalizer

Results 38 sacremoses issues
Sort by recently updated
recently updated
newest added

As Python 2 support has been officially dropped since 0.0.41 (cf. #94), this pull request cleans up all Python 2 compatibility-related codes and gets rid of the `six` dependency to...

Compiles regexs where appropriate for improved perf for common operations (subs, searches, matches, finditers). Timeit info below for a microbenchmark (`MT1` is original w/o compilation, `MT2` is new w/ compilation...

## Description Add tokenization support for the Tetun language (tdt). Equivalent of https://github.com/moses-smt/mosesdecoder/pull/224 Tetun has words that contain apostrophes (e.g "me" in Tetun is "ha'u"). The logic here will keep...

Many projects use versions of `click` that aren't `8.0`, and we should allow that. Ideally, we should also figure out what caused `8.1.3` to break. Also, apparently `8.0.0` is broken.

old version will cause 'int' + ‘str' error and is not what mosesdecoder actually does.

``` (掌声)这个是,盛装舞步。 ``` The result generated by command `sacremoses -l zh -j 4 tokenize < input > output` is ``` ( 掌声 ) 这个是 , 盛装舞步。 ``` I think it...

Just wanted to have a reference here to https://huggingface.co/dsilin/detok-deberta-xl It's a deep detokenizer trained to reverse sacremoses (html unescaped) If enough people find it useful maybe a footnote in the...

text = "will not be the true meaning. always remember that our mind" print(moses_tokenizer.tokenize(text, escape=False)) I get the following output ['will', 'not', 'be', 'the', 'true', 'meaning.', 'always', 'remember', 'that', 'our',...

Hi there, Moses has a `detokenize_penn()` method in Perl but I can't find it here. This means that things like `wo n't` or `ca n't` can't be detokenised. Any chance...