corncob parallel processing

It seems to me that running differentialTest is a perfectly parallel problem. For my current project, the time taken is too short to just start an analysis in the background, forget about it, and return to look at the results at a later time but it is too long to make interactive work really pleasant.

As far as I can tell, it is a single for loop running all the models. What do you think about parallelizing that, for example, foreach could be a quick replacement?

Mar 17 '21 20:03 Midnighter

Another great idea, thanks @Midnighter .

Here's a question for you as someone interested in this feature. Would you prefer:

New parameters in differentialTest() such as parallel and ncores
A new function such as differentialTest_parallel()

Basically, I'm wondering which of these designs you think would be more intuitive. If anyone else happens to see this before I implement, feel free to offer your opinion as well.

Mar 17 '21 21:03 bryandmartin

I definitely prefer another parameter on the existing function. I realize this will require some internal restructuring (maybe an opportunity to refactor some of the code).

If you do use foreach to implement this, another option is to let the user create the backend and decide what to run based on that. Similar to the following pseudo code.

library(foreach)
library(doParallel)

# user-defined
registerDoParallel(3)

# within differentialTest
if (getDoParRegistered() & getDoParWorkers() > 1) {
  foreach(...) %dopar% {
  }
} else {
  foreach(...) %do% {
  }
}

Mar 17 '21 23:03 Midnighter

Good stuff, thanks a ton! I'll implement it that way.

Mar 18 '21 00:03 bryandmartin

If you were inclined to rework your code to process data with dplyr and purrr rather than for loops, there is also furrr offering some parallel implementations of purrr functions. Especially for model fitting there are many examples out there but I guess it would mean almost a complete rewrite of corncob.

Mar 18 '21 08:03 Midnighter

One gotcha here that we encountered is that if you have a parallel BLAS backend (like MKL or OpenBLAS) a lot of the linear algebra will already run in parallel. If you spawn too many processes this will often easily choke, so you need to set "OMP_NUM_THREADS" or equivalent for that to be efficient. Also mclapply might be enough here and requires no additional dependencies.

Mar 18 '21 21:03 cdiener

corncob corncob copied to clipboard

parallel processing

corncob
corncob copied to clipboard