Joblib.dump should use cloudpickle
From https://github.com/scikit-learn/scikit-learn/pull/12905#issuecomment-455912267 it would be useful to consider a modern pickle for joblib.dump/load which does not silently pickle unloadable options.
The problem is that joblib dump supports things that are not supported by pickle / cloudpickle, namely doing no-copy dump load of large numpy arrays (streaming pickling) and memory mapping of large numpy arrays that are not supported by cloudpickle either.
In future numpy / python versions, with the support of PEP 574, it will be possible to have to have all those no-copy / memmap features with the native C-pickler implementation (instead of ugly internal hacks in joblib).
Related: @pierreglaser is also working on a C implementation of the main cloudpickle features to be contributed upstream to the standard library, hopefully for Python 3.8.
It might be possible to extend the internal NumpyPickler or joblib.dump / load to derive from CloudPickler though (in a shorter term). I am not sure whether this would be complex or not.
@ogrisel making the NumpyPickler inherit the CloudPickler class is also what I suggested in the upstream sklearn issue. There does not seem to be strong conflicts at first glance.
Sounds great, I am not familiar with the joblib codebase - I'd love to help and make this fix happen.
Can you please point me to the right place ?
Searching the code for NumpyPickler is probably a good start! But so is writing a test!
There is another issue: cloudpickle is only meant to pickle transient object between Python workers for parallel processing on a multicore machine or a cluster. It is not meant to work if the pickling code and the depickling code use different versions of Python which can be potentially be the case when using a shared joblib cache folder.
w.r.t the numpy pickler, once PEP 574 is widely adopted, we can rewrite joblib to use this which means that the resulting pickle file would respect the pickle protocol (and can be loaded by the regular pickle.load instead of having to use joblib.load).
Hi @ogrisel,
I hope all is well with you 😄 .
I am following up on this rather old thread for a perhaps stupid question. I have read the doc and a few related issues, but haven't been able to understand with good certainty.
Is the latest version of joblib now using cloudpickle as a backend for joblib.dump? If so, how to activate it?
Cheers,
Alex