alvations issues

Results 70 issues of


                                            alvations

Documentation

Now that we have more than tokenization, we need some proper documentation.

Moses default behavior is to use a regex to deduplicate spaces, https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L507 But in Python, this can be don't by doing `" ".join(text.split())` Would it be appropriate to change the...

help wanted

question

Difference between regex and re library

Would `regex` be faster? Or more stable? Would changing from `re` to `regex` break the existing rules?

help wanted

question

Testing CLI commands

The CLI commands should be tested in the CI. c.f. #36

CLI not prompting when nothing is streaming into the command

While this works: ``` $ sacremoses train-truecase -m big.model -j 4 < big.txt.tok ``` It's sort of a gotcha when nothing is streaming in and the script just sits there,...

Bug in final apostrophe!!

Bug in final apostrophe from original Moses!! **Original Moses**: ```shell $ cat in.txt dip dye hand-tufted ivory / navy area rug, 8' x 10' azzura hill hand-tufted ivory indoor/outdoor area...

Using as little CPU RAM and cores as possible, when decoding

Is there an option to build the Marian binary such that it will use as little CPU RAM and cores as much as possible? It looks like it's using a...

Verbnet identifier in index.xml mismatch

When recompiling the `nltk_data`, it throws this error: ``` nltk_data$ make python tools/build_pkg_index.py . https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages index.xml Traceback (most recent call last): File "tools/build_pkg_index.py", line 24, in index = build_index(ROOT, BASE_URL)...

Alternate mirroring of nltk_data on Zenodo

One proposal to handle nltk/nltk#1787 is to find alternative site that can handle content distribution network and resolve high frequency requests appropriately. After some playing around, it's possible to mirror...

How to cite the Japanese word2vec model?

For the japanese word2vec model, how should we reference it? Which year is it created?