sacremoses
sacremoses copied to clipboard
Python port of Moses tokenizer, truecaser and normalizer
Bug in final apostrophe from original Moses!! **Original Moses**: ```shell $ cat in.txt dip dye hand-tufted ivory / navy area rug, 8' x 10' azzura hill hand-tufted ivory indoor/outdoor area...
Is MosesPunctNormalizer.normalize() thread safe? It seems to be working but just wanted to double check.
WIP in squeezing a bit more performance out of the detokenizer.
This PR copies from and replaces #114 which seems to be stale for more than 2 years, and also updates nonbreaking prefixes for Tetun Dili.
By default the library is not using protected patterns such of `WEB_PROTECTED_PATTERNS` which contains for example URLs and emails patterns. ```python # Example tokenizer.tokenize("http://www.someurl.com") # Expected output ["http://www.someurl.com"] # sacremoses...
I have added two speed improvements: - Compile regex patterns. - Pre-define the character sets for islower() and isanyalpha(). Before: ``` Benchmark 1: python -m sacremoses -l en -j 1...
```python3 from sacremoses import MosesTokenizer print(MosesTokenizer(lang='en').penn_tokenize("-LRB- This is very nice -RRB-")) ``` I got the following error. And I found changing `lang='en'` to `lang='zh'` doesn't solve the problem. ``` Traceback...
The CLI flags and chaining though pipeline should be tested with a little more robustness than just the examples in README.md Not sure if it is still the case, but...