Kenneth Benoit comments

Results 258 comments of


                                            Kenneth Benoit

Add convert(x, to = "kerasR") functionality

No, because the dfm would erase the sequence of words. (Although removing them based on frequency destroys the original sequence too!) However if you simply want them removed from the...

Add convert(x, to = "kerasR") functionality

Hi all, picking up this issue since we are working on a prototype in a new testing package [**quanteda.classifiers**](https://github.com/quanteda/quanteda.classifiers) that will make fitting and predicting work in the same way...

Remove wordstem_ngrams()

True, and once upon a time this is what we did, but then in response to #127 we added this. What harm does it do? More concerning to me is...

Remove wordstem_ngrams()

Right now, `wordstem_ngrams()` is purely internal, and is called when a `tokens_wordstem()` is called on a ngram > 2 tokens, or when `dfm_wordstem()` is called on an ngram > 2...

Remove wordstem_ngrams()

I'm voting to close this - any reason why we would not want correct stemming of each element of an ngram?

Remove wordstem_ngrams()

But what if the tokens are inherited, say from another tokeniser, or compounded following some detection of collocations? It's more accurate to say that it's alternative functionality for most use...

Consider parallelizing tokenization

I saw a presentation on this remarkable package at the RStudio::conf in January, was thinking exactly the same!

Consider parallelizing tokenization

I found the same using other parallelisation methods (from R): the vectorized `stri_split_boundaries()` was still fastest. Still we batch the tokens now, maybe that could be executed in parallel?

Consider parallelizing tokenization

I get, on macOS, running latest R 4.0: ```r > plan(multiprocess) Warning message: [ONE-TIME WARNING] Forked processing ('multicore') is disabled in future (>= 1.13.0) when running R from RStudio, because...

Consider parallelizing tokenization

Outside of RStudio (plain R console): ```r expr min lq mean median uq max neval cld tokens_parallel(txt) 7.83327 8.651916 11.62100 9.208451 11.29673 30.89798 10 a tokens(txt) 24.98809 25.313775 25.96159 25.797100...