sacremoses issues

Bug in final apostrophe!!

Bug in final apostrophe from original Moses!! **Original Moses**: ```shell $ cat in.txt dip dye hand-tufted ivory / navy area rug, 8' x 10' azzura hill hand-tufted ivory indoor/outdoor area...

alvations

Is this package multi-threaded?

Is MosesPunctNormalizer.normalize() thread safe? It seems to be working but just wanted to double check.

ejkitchen

MosesDetokenizer performance

WIP in squeezing a bit more performance out of the detokenizer.

jelmervdl

Add tokenization for Tetun Dili (tdt)

4

This PR copies from and replaces #114 which seems to be stale for more than 2 years, and also updates nonbreaking prefixes for Tetun Dili.

BLKSerene

Web and basic protected patterns by default

By default the library is not using protected patterns such of `WEB_PROTECTED_PATTERNS` which contains for example URLs and emails patterns. ```python # Example tokenizer.tokenize("http://www.someurl.com") # Expected output ["http://www.someurl.com"] # sacremoses...

samirsalman

2x faster MosesTokenizer

I have added two speed improvements: - Compile regex patterns. - Pre-define the character sets for islower() and isanyalpha(). Before: ``` Benchmark 1: python -m sacremoses -l en -j 1...

de9uch1

Regular expression error of penn_tokenize

2

```python3 from sacremoses import MosesTokenizer print(MosesTokenizer(lang='en').penn_tokenize("-LRB- This is very nice -RRB-")) ``` I got the following error. And I found changing `lang='en'` to `lang='zh'` doesn't solve the problem. ``` Traceback...

speedcell4

More robust testing for chaining sacremoses CLI

The CLI flags and chaining though pipeline should be tested with a little more robustness than just the examples in README.md Not sure if it is still the case, but...

alvations

sacremoses
sacremoses copied to clipboard

Metadata

Bug in final apostrophe!!

Is this package multi-threaded?

MosesDetokenizer performance

Add tokenization for Tetun Dili (tdt)

Web and basic protected patterns by default

2x faster MosesTokenizer

Regular expression error of penn_tokenize

More robust testing for chaining sacremoses CLI

← Metadata

Owner

Metadata

sacremoses sacremoses copied to clipboard

Metadata

← Metadata

Owner

Metadata

sacremoses
sacremoses copied to clipboard