pyGAM icon indicating copy to clipboard operation
pyGAM copied to clipboard

add joblib

Open jeweinberg opened this issue 6 years ago • 6 comments

Is there a plan to add joblib into the project? It would be nice to be able to set n_jobs for each of the algorithms similar to sklearn.

jeweinberg avatar Sep 14 '17 18:09 jeweinberg

yes! i would love to add that for parallelizing the gridsearch and bootstrapping tasks.

i've only messed around with it a little bit with Multiprocessing (https://docs.python.org/2/library/multiprocessing.html) and Pathos (https://github.com/uqfoundation/pathos) but i've been running into problems getting the arguments to pickle/dill correctly, and other basic things.

i've left it on the back-burner for now, since i think there are bigger fish to fry, but if you are knowledgeable about it, i would certainly welcome a PR or example :)

dswah avatar Sep 14 '17 19:09 dswah

The concept is fairly simple. You just have a function that you pass through parallel.

from math import sqrt from joblib import Parallel, delayed Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(10)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

jeweinberg avatar Sep 14 '17 19:09 jeweinberg

for some reason i don't think i've tried joblib...

in practice i've run into implementations issues (some stemming from OSX) with the various fork types (fork, forkserver, spawn, etc), executing numpy code in the forked process, and pickling the objects that get sent to the child processes...

i'll give this a shot.

dswah avatar Sep 14 '17 20:09 dswah

@dswah I highly recommend focusing on joblib for parallelism! IT is supported by the whole pydata stack and Dask ist able to provide a backend for joblib. Ergo using joblib gives pyGAM the ability to be distributed on thousands of worker nodes, making it a potential big data tool. Read this for more info. Joblib is also really easy to use for programers as @jeweinberg pointed out.

h4gen avatar Nov 01 '18 17:11 h4gen

@jeweinberg @h4gen thanks for the tips.

i am adding joblib for concurrent execution and out-of-core learning.

do you all know if it is necessary to add the partial_fit() method for distributed fitting with dask?

dswah avatar Nov 02 '18 16:11 dswah

@dswah It should not be necessary as far as I understand it. The usage of Parallel and delayed from joblib should be sufficient when dask is used as backend. Dask distributes the data as well as the jobs all by itself. dask devs may correct me if I am wrong.

h4gen avatar Nov 02 '18 21:11 h4gen