sacremoses
sacremoses copied to clipboard

Published 20 hours ago •

→

Metadata

Python port of Moses tokenizer, truecaser and normalizer

Reame
Issues

Results 38 sacremoses issues

Sort by recently updated

Clean up Python 2 compatibility-related codes

As Python 2 support has been officially dropped since 0.0.41 (cf. #94), this pull request cleans up all Python 2 compatibility-related codes and gets rid of the `six` dependency to...

compile regex objects ahead of time for improved perf.

1

comment

Compiles regexs where appropriate for improved perf for common operations (subs, searches, matches, finditers). Timeit info below for a microbenchmark (`MT1` is original w/o compilation, `MT2` is new w/ compilation...

Add tokenization for Tetun (tdt)

## Description Add tokenization support for the Tetun language (tdt). Equivalent of https://github.com/moses-smt/mosesdecoder/pull/224 Tetun has words that contain apostrophes (e.g "me" in Tetun is "ha'u"). The logic here will keep...

Restrict click to be <8.1

3

comment

Many projects use versions of `click` that aren't `8.0`, and we should allow that. Ideally, we should also figure out what caused `8.1.3` to break. Also, apparently `8.0.0` is broken.

Fixed protected patterns, truecase logic

fix "int' + 'str' error, revert to what mosesdecoder does

1

comment

old version will cause 'int' + ‘str' error and is not what mosesdecoder actually does.

Chinese full stop “。” can't be split.

1

comment

``` （掌声）这个是，盛装舞步。 ``` The result generated by command `sacremoses -l zh -j 4 tokenize < input > output` is ``` （掌声）这个是，盛装舞步。 ``` I think it...

deep detokenizer

Just wanted to have a reference here to https://huggingface.co/dsilin/detok-deberta-xl It's a deep detokenizer trained to reverse sacremoses (html unescaped) If enough people find it useful maybe a footnote in the...

can't tokenise the period properly

text = "will not be the true meaning. always remember that our mind" print(moses_tokenizer.tokenize(text, escape=False)) I get the following output ['will', 'not', 'be', 'the', 'true', 'meaning.', 'always', 'remember', 'that', 'our',...

No detokenize_penn?

Hi there, Moses has a `detokenize_penn()` method in Perl but I can't find it here. This means that things like `wo n't` or `ca n't` can't be detokenised. Any chance...

1
2
3
4
›

About

Python port of Moses tokenizer, truecaser and normalizer

tokenizer

nlp

machine-translation

479

Stars

59

Forks

Watchers

Owner

← Metadata

479

Stars

59

Forks

Watchers

Owner

Metadata

Python port of Moses tokenizer, truecaser and normalizer