Kohei Watanabe comments

Results 164 comments of


                                            Kohei Watanabe

Remove wordstem_ngrams()

I agree that there are such tokens in foreign objects, so it is fine that such a function exists, but not in the core quanteda. The best place for these...

Consider parallelizing tokenization

I experimented with parallelization in different ways. My initial idea was call `stri_split_boundaries()` in parallel but it was slower, probably because of the large object size (list of character vectors)....

Consider parallelizing tokenization

Yes, this part will be the target of parallelization. https://github.com/quanteda/quanteda/blob/70ceece7f93901e60e7cd67fe88ff97d17306e68/R/tokens.R#L274-L286 but serialization depends on `attr(x[[i - 1]], "types")`, so need to make recompiliation very fast, especially https://github.com/quanteda/quanteda/blob/70ceece7f93901e60e7cd67fe88ff97d17306e68/R/tokens-methods-base.R#L184-L185 I think I...

Consider parallelizing tokenization

```r require(quanteda) # parallel 1 (tokenize and serialize in R) toks1

Consider parallelizing tokenization

Actually, parallel C++ or R is not faster than serial R in simple remapping. ```r corp

Consider parallelizing tokenization

@kbenoit please try `tokens_parallel()` in the `dev-tokens_parallel` branch. It seems about three times faster on a machine with 4 cores. ```r require(quanteda) require(future) corp

Consider parallelizing tokenization

We might need to set `options(mc.cores=8)` for `future_lapply` too, but already promising! The advantage of parallel lapply becomes greater when ndoc is larger.

Consider parallelizing tokenization

Interestingly, `parallel::mclapply` outperformed `future_lapply()` when executed by Rscript. ``` Unit: seconds expr min lq mean median tokens_test(txt) 100.38170 102.46677 106.11569 103.97826 tokens_test(txt, FUN = future_lapply) 43.63455 46.37512 50.62325 47.57711 tokens_test(txt,...

Consider parallelizing tokenization

On Widnows (Rstudio) ``` expr min lq mean median uq max neval tokens_parallel(txt) 38.19173 39.43982 42.30942 41.41162 41.63359 55.98195 10 tokens(txt) 87.46992 88.77185 89.83614 89.41530 91.63647 91.99573 10 ```

Consider parallelizing tokenization

Now, we can use `future_lapply()` but the questions is it reliable enough to replace `lapply`. It would be safer to allow users to use `lapply` when `tokens("future" = FALSE)` or...