alvations comments

Results 155 comments of


                                            alvations

trafficstars

sentence piece cannot get bleu-detok score when training.

Heads up on JESC, it's pretty noisy =)

Extra blank space at the end of nonbreaking_prefix.en at end of line 103

@advpetc Thank you for reporting the issue. Could you explain it with a little more detail? It's a little unclear what changes you are proposing. - What is the input...

Extra blank space at the end of nonbreaking_prefix.en at end of line 103

Yes, the `MosesTokenizer` output in NLTK doesn't correspond to the one from Moses, the NLTK output shouldn't be the expected behavior: ``` ~/mosesdecoder/scripts/tokenizer$ perl tokenizer.perl -l en Tokenizer Version 1.1...

Extra blank space at the end of nonbreaking_prefix.en at end of line 103

Yes, removing the extra space in nonbreaking prefix for the `No #NUMERIC_ONLY#` line solves the problem. After removing the extra space: ``` >>> from nltk.tokenize.moses import MosesTokenizer moses =>>> moses...

Extra blank space at the end of nonbreaking_prefix.en at end of line 103

A regression test of Moses vs NLTK implementation would be good to test all these kinks =) I've just checked the `nonbreaking_prefixes.en` from https://github.com/alvations/mosesdecoder/blob/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en#L103 there's a space there too. So...

wrong german stopwords in stopwords corpora

The "non-words" raised by @juh2 should have been resolved in #49 ```python >>> from nltk.corpus import stopwords >>> deu_stops = stopwords.words('german') >>> 'unse' in deu_stops False >>> 'unsem' in deu_stops...

framenet_v15 appears to be corrupt

Sorry for missing this issue out. There shouldn't be any problem with the framenet_v15.zip now in the latest version of nltk and nltk_data. @slremy Are you still having an issue...

framenet_v15 appears to be corrupt

The UI is a little buggy. Try using `nltk.download('framenet_v15')`

Verbnet identifier in index.xml mismatch

This is because both `verbnet` and `verbnet3` has the same `id`: ``` nltk_data/packages/corpora$ cat verbnet.xml nltk_data/packages/corpora$ cat verbnet3.xml ``` The same identifier is causing the mismatch in the `nltk` code...

License

The different resources in `nltk_data` comes in different licenses. The licenses of the individual resources in `nltk_data` should be safe for re-distribution. It'll be great to package `nltk_data`, would it...