alvations comments

Results 154 comments of


                                            alvations

trafficstars

Improve tokenization of Multi Word Expressions by including "python partitioner"

Actually, if https://github.com/jakerylandwilliams/partitioner is already a working package in Python, there might not be a need to port/reimplement code. Users can easily choose to use the tokenizer directly form partitioner....

Runtime warning when using nltk.downloader from CLI

Instead of using the `pipenv run python ...`, you could try: ``` RUN python -c "import nltk; nltk.download('popular')" ``` Or ``` RUN python -m nltk.downloader popular ``` The `popular` collection...

Extract wordnet into a separate package

@fcbond any advice on this?

Splitting sentences fails on some corner cases

Quick hack, following #2154 ```python >>> import nltk >>> punkt = nltk.data.load('tokenizers/punkt/english.pickle') >>> punkt._params.abbrev_types.add('al') >>> text = 'If David et al. get the financing, we can move forward with the...

Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English?

+1 @nschneid Most of Rebecca's work is in HPSG which I would love to integrate into NLTK but it's a tough nut. @goodmami, @fcbond and the DELPH-IN group has done...

Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English?

@nschneid after some trawling on the REPP code, there're quite a lot of LISP rules written in separate file. Maybe the first thing we could try is to organize all...

Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English?

@nschneid for now, the simplest solution seems to be wrapping REPP and reading the output files like other third party tools in NLTK. It seems simple enough and there are...

Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English?

+1 for TokenizeAnything. There's also https://github.com/jonsafari/tok-tok from @jonsafari.

Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English?

I've written a small wrapper for REPP: https://github.com/alvations/nltk/blob/repp/nltk/tokenize/repp.py. Will do a PR once the `translate` modules are more stable.

Incorporate more accurate sentence-splitter, tokenizer, and/or lemmatizer for English?

Ported `tok-tok.pl` into python: https://github.com/alvations/nltk/blob/repp/nltk/tokenize/toktok.py too.