sentimentr icon indicating copy to clipboard operation
sentimentr copied to clipboard

Add a parallel option

Open trinker opened this issue 6 years ago • 5 comments

A parallel option that runs sentiment and sentiment_by on multiple cores

trinker avatar Jul 26 '17 15:07 trinker

Dump everything out to temp rds and read back to the clusters...add a library arg

trinker avatar Dec 18 '17 13:12 trinker

Initial attempts leads to error on Windows (~~parallel seems to be using an old version of R and throws an error with regard to Rcpp being the wrong version~~ fixed this by using newer version of R on path but now an error related to sentimentr indicating still an old version???). Maybe need to remove all R from path??

if (!require("pacman")) install.packages("pacman")
pacman::p_load(sentimentr, parallel, textshape, dplyr)


chunk_size <- 1e5
dir.create('data')

dat <- combine_data() %>%
    {.[rep(seq_len(nrow(.)), 100),]} %>%
    sample_n(nrow(.)) %>%
    split_index({inds <- chunk_size * 1:round(nrow(.)/chunk_size, 0); inds[inds < nrow(.)]})

tic <- Sys.time()

cl <- makeCluster(mc <- getOption("cl.cores", detectCores() - 2))

clusterEvalQ(cl, {
    library(sentimentr)
    library(lexicon)
})


parLapply(cl, dat, function(x){

    gc()

    senti_dat <- sentimentr::get_sentences(x)
    senti_dat <- sentimentr::sentiment_by(senti_dat)

    outfile <- sprintf('data/file_%s.rds', sample(1:100000))
    saveRDS(senti_dat, outfile)

}) %>%
    invisible()

stopCluster(cl)

Sys.time() - tic

Results in:

Error in checkForRemoteErrors(val) : 
  6 nodes produced errors; first error: 'get_sentences' is not an exported object from 'namespace:sentimentr'

trinker avatar Feb 10 '18 16:02 trinker

http://appliedpredictivemodeling.com/blog/2018/1/17/parallel-processing

Is either of the following a better way to run parallel code:

https://github.com/r-lib/callr https://github.com/r-lib/processx

A OS independent solution is needed. Re investigate available solutions and reach out to the R community for current best practices.

trinker avatar Sep 24 '18 01:09 trinker

Here's where I ask the R community: https://twitter.com/tylerrinker/status/1044364197797265408

  • https://github.com/HenrikBengtsson/future recommended by Julia Silge
  • https://github.com/DavisVaughan/furrr recommended by Garrett Mooney

trinker avatar Sep 24 '18 23:09 trinker

Some other packages:

  • https://cran.r-project.org/package=snow
  • https://cran.r-project.org/package=pbdMPI Futures looks easiest to use, but MPI has a long history of support. A tutorial with some relevant further reading: https://towardsdatascience.com/getting-started-with-parallel-programming-in-r-d5f801d43745

bkmgit avatar Oct 29 '20 17:10 bkmgit