sacremoses icon indicating copy to clipboard operation
sacremoses copied to clipboard

Python port of Moses tokenizer, truecaser and normalizer

Results 38 sacremoses issues
Sort by recently updated
recently updated
newest added

Bug in final apostrophe from original Moses!! **Original Moses**: ```shell $ cat in.txt dip dye hand-tufted ivory / navy area rug, 8' x 10' azzura hill hand-tufted ivory indoor/outdoor area...

Is MosesPunctNormalizer.normalize() thread safe? It seems to be working but just wanted to double check.

WIP in squeezing a bit more performance out of the detokenizer.

This PR copies from and replaces #114 which seems to be stale for more than 2 years, and also updates nonbreaking prefixes for Tetun Dili.

By default the library is not using protected patterns such of `WEB_PROTECTED_PATTERNS` which contains for example URLs and emails patterns. ```python # Example tokenizer.tokenize("http://www.someurl.com") # Expected output ["http://www.someurl.com"] # sacremoses...

I have added two speed improvements: - Compile regex patterns. - Pre-define the character sets for islower() and isanyalpha(). Before: ``` Benchmark 1: python -m sacremoses -l en -j 1...

```python3 from sacremoses import MosesTokenizer print(MosesTokenizer(lang='en').penn_tokenize("-LRB- This is very nice -RRB-")) ``` I got the following error. And I found changing `lang='en'` to `lang='zh'` doesn't solve the problem. ``` Traceback...

The CLI flags and chaining though pipeline should be tested with a little more robustness than just the examples in README.md Not sure if it is still the case, but...