Kenneth Benoit
Kenneth Benoit
No, because the dfm would erase the sequence of words. (Although removing them based on frequency destroys the original sequence too!) However if you simply want them removed from the...
Hi all, picking up this issue since we are working on a prototype in a new testing package [**quanteda.classifiers**](https://github.com/quanteda/quanteda.classifiers) that will make fitting and predicting work in the same way...
True, and once upon a time this is what we did, but then in response to #127 we added this. What harm does it do? More concerning to me is...
Right now, `wordstem_ngrams()` is purely internal, and is called when a `tokens_wordstem()` is called on a ngram > 2 tokens, or when `dfm_wordstem()` is called on an ngram > 2...
I'm voting to close this - any reason why we would not want correct stemming of each element of an ngram?
But what if the tokens are inherited, say from another tokeniser, or compounded following some detection of collocations? It's more accurate to say that it's alternative functionality for most use...
I saw a presentation on this remarkable package at the RStudio::conf in January, was thinking exactly the same!
I found the same using other parallelisation methods (from R): the vectorized `stri_split_boundaries()` was still fastest. Still we batch the tokens now, maybe that could be executed in parallel?
I get, on macOS, running latest R 4.0: ```r > plan(multiprocess) Warning message: [ONE-TIME WARNING] Forked processing ('multicore') is disabled in future (>= 1.13.0) when running R from RStudio, because...
Outside of RStudio (plain R console): ```r expr min lq mean median uq max neval cld tokens_parallel(txt) 7.83327 8.651916 11.62100 9.208451 11.29673 30.89798 10 a tokens(txt) 24.98809 25.313775 25.96159 25.797100...