alvations comments

Results 153 comments of


                                            alvations

Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English?

@goodmami I would put up the information on https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software once I get some time to do a PR. I wrote the wrapper while being stuck on a train without wifi...

Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English?

If anyone else is interested in reimplementing/wrapping other tokenizers/stemmers/lemmatizers, I would suggest the following list of tools that would make a `good-first-contribution` to NLTK. - [Moses tokenizer + detokenizer](https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer) -...

Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English?

Now with Moses tokenizer and detokenizer working (#1551, #1553), any brave soul want to try reimplementing [Elephant](http://gmb.let.rug.nl/elephant/about.php) with `sklearn` ?

Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English?

From #1860, it looks like the TreebankTokenizer that we're using as the default `word_tokenize()` is rather outdated and URL and dates parsing isn't really supported. Taking a closer look at...

nltk.download() logs out on OS X

Personally, I would advise against using the GUI for NLTK download. But to explain the problem, it's most probably because of `tkinter`. On Mac OS, maybe this works, assuming that...

Using as little CPU RAM and cores as possible, when decoding

In `model.npz.decoder.yml`: ``` models: - model.npz vocabs: - vocab.src.spm - vocab.trg.spm beam-size: 6 normalize: 0.6 word-penalty: 0 mini-batch: 16 maxi-batch: 100 maxi-batch-sort: src relative-paths: false ``` On CLI, with GPU...

Using as little CPU RAM and cores as possible, when decoding

`--workspace` limits the GPU RAM (sort of) But does `--workspace` work for CPU RAM too?

Domain adaptation using Marian

Just found out @mjpost and sockeye's https://awslabs.github.io/sockeye/tutorials/adapt.html and the "continue" training routine from https://arxiv.org/pdf/1612.06897v1.pdf Sometimes, we kind of use something like that by writing bash script and substituting the training...

Docker images

Like this, in its rawest form: https://gist.github.com/alvations/3d8db26ea5e8f803df6730b652223e0e But the Opus-MT one looks much more stable with proper version checkouts.

Word missing in words

See https://stackoverflow.com/questions/44449284/nltk-words-corpus-does-not-contain-okay