alvations
alvations
Now that we have more than tokenization, we need some proper documentation.
Moses default behavior is to use a regex to deduplicate spaces, https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L507 But in Python, this can be don't by doing `" ".join(text.split())` Would it be appropriate to change the...
Would `regex` be faster? Or more stable? Would changing from `re` to `regex` break the existing rules?
The CLI commands should be tested in the CI. c.f. #36
While this works: ``` $ sacremoses train-truecase -m big.model -j 4 < big.txt.tok ``` It's sort of a gotcha when nothing is streaming in and the script just sits there,...
Bug in final apostrophe from original Moses!! **Original Moses**: ```shell $ cat in.txt dip dye hand-tufted ivory / navy area rug, 8' x 10' azzura hill hand-tufted ivory indoor/outdoor area...
Is there an option to build the Marian binary such that it will use as little CPU RAM and cores as much as possible? It looks like it's using a...
When recompiling the `nltk_data`, it throws this error: ``` nltk_data$ make python tools/build_pkg_index.py . https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages index.xml Traceback (most recent call last): File "tools/build_pkg_index.py", line 24, in index = build_index(ROOT, BASE_URL)...
One proposal to handle nltk/nltk#1787 is to find alternative site that can handle content distribution network and resolve high frequency requests appropriately. After some playing around, it's possible to mirror...
For the japanese word2vec model, how should we reference it? Which year is it created?