joblib slow memory retrieval (significantly slower then simple pickle)

Hi,

I'm little confused by why does reading and writing to (file based) "memory" take so enormous amount of time compared to bare pickling/unpickling.

In my case, func() is a tiny memorized function that takes a short string argument and returns a (short) dict with (long) lists of ~complex objects. For some reason, function retrieval from cache takes significantly more time then just unpickleing file. Resulting file is approximately 70Mb.

I observe same thing for any other function.

%prun func(some_str)

        1   12.436   12.436   52.011   52.011 pickle.py:1014(load)
 41531482    7.665    0.000   11.931    0.000 pickle.py:226(read)
  1922386    5.547    0.000    7.339    0.000 pickle.py:1504(load_build)
 41531483    4.266    0.000    4.266    0.000 {method 'read' of '_io.BufferedReader' objects}
  6490284    3.753    0.000    6.666    0.000 pickle.py:1439(load_long_binput)
  2645763    2.666    0.000    4.764    0.000 pickle.py:1192(load_binunicode)
 30070039    2.403    0.000    2.403    0.000 {built-in method builtins.isinstance}
  4140172    1.870    0.000    3.225    0.000 pickle.py:1415(load_binget)
  1922386    1.369    0.000    2.049    0.000 pickle.py:1316(load_newobj)
  9196954    1.359    0.000    1.359    0.000 {built-in method _struct.unpack}
  1922386    1.114    0.000    8.724    0.000 numpy_pickle.py:319(load_build)
 10857316    0.962    0.000    0.962    0.000 {method 'pop' of 'list' objects}
 14536246    0.873    0.000    0.873    0.000 {method 'append' of 'list' objects}
  1922386    0.873    0.000    1.218    0.000 pickle.py:1472(load_setitem)
  1922393    0.816    0.000    0.816    0.000 {built-in method builtins.getattr}
   676815    0.765    0.000    1.384    0.000 pickle.py:1458(load_appends)
  1922387    0.730    0.000    0.832    0.000 pickle.py:1257(load_empty_dictionary)
        1    0.715    0.715   53.099   53.099 <string>:1(<module>)
  1245385    0.559    0.000    0.848    0.000 pickle.py:1451(load_append)
...

%prun len(pickle.load(open("..file..", 'rb')))
        1    4.587    4.587    4.587    4.587 {built-in method _pickle.load}
        1    0.553    0.553    5.140    5.140 <string>:1(<module>)
        1    0.000    0.000    5.140    5.140 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method io.open}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Dec 21 '16 20:12 MInner

From the profiles I would guess this is because joblib is using the python pickle, mostly to be able to hack into the pickle mechanism and specialise it for numpy arrays. You are probably using python 3 for which pickle is the C pickle. Have a look at #421 for example where this was discussed in more details.

Jan 03 '17 13:01 lesteve

Indeed. joblib is not very useful when dealing with Python objects with a large number of small sub objects. It's more useful for Python objects with a very few large subobjects (e.g. large numpy arrays).

I am afraid that there is little we can do about it.

Jan 04 '17 09:01 ogrisel

I am afraid that there is little we can do about it.

Would it be possible to write in the stored file whether or not there are numpy objects in it, and if there are none use the standard unpickler at load time?

Jan 04 '17 09:01 GaelVaroquaux

Walking the graph structure of such complex objects is actually quite costly (in Python, it's much faster in C and this is probably why the cPickle implementation is much faster in the first place).

Jan 05 '17 21:01 ogrisel

Walking the graph structure of such complex objects is actually quite costly.

But once the storage is finished, it's done, so after storing, we should know.

Jan 05 '17 23:01 GaelVaroquaux

In the Memory case, I guess whether output.pkl contains a numpy array or not could be written in a metadata file in the same folder (e.g. in metadata.json). This way you know at load-time you can use pickle.load directly rather than joblib.load?

Jan 06 '17 07:01 lesteve

But once the storage is finished, it's done, so after storing, we should know.

This only solves half of the problem : storing is still slow but loading is fast. Also storing in the file itself whether it contains or not a numpy array also introduces another format change with backward compatibility necessity.

Jan 09 '17 08:01 aabadie

In the Memory case, I guess whether output.pkl contains a numpy array or not could be written in a metadata file in the same folder (e.g. in metadata.json). This way you know at load-time you can use pickle.load directly rather than joblib.load?

Sounds like a potential solution. So pickle.dump should return something indicating whether there's a numpy array or not ? And the memory API is responsible of using it or not.

Jan 09 '17 08:01 aabadie

s/pickle.load/pickle.dump/ in previous comment

Jan 09 '17 08:01 aabadie

Actually thinking about it, maybe the cleanest thing to do is to add a use_joblib_pickling (for lack of a better name) argument to Memory, which should be True by default.

Jan 09 '17 11:01 lesteve

Actually thinking about it, maybe the cleanest thing to do is to add a use_joblib_pickling argument to Memory, which should be True by default.

That would work, but the solution of storing whether or not it is a standard pickle would be automatic for the user, no?

I am not sure anymore: if there are no numpy arrays, is the stored file a standard pickle or isn't it?

Jan 09 '17 14:01 GaelVaroquaux

if there are no numpy arrays, is the stored file a standard pickle or isn't it?

it is

Jan 09 '17 14:01 aabadie

it is

Awesome, so if we know that there are no numpy arrays in it, we can use the fast loading route, right?

Jan 09 '17 14:01 GaelVaroquaux

so if we know that there are no numpy arrays in it, we can use the fast loading route, right?

Yes but it can be costly to determine whether the arbitrary object contains a numpy array or not (if we use the python implementation of pickle).

Jan 09 '17 14:01 aabadie

Yes but it can be costly to determine whether the arbitrary object contains a numpy array or not (if we use the python implementation of pickle).

At write time we can insert a code path in our pickler which detects that.

Jan 09 '17 15:01 GaelVaroquaux

a code path

What do you mean ?

Jan 09 '17 15:01 aabadie

What do you mean ?

In our NumpyPickler, we can change a flag (on the pickler for instance) when we hit an array.

Jan 09 '17 15:01 GaelVaroquaux

In our NumpyPickler, we can change a flag (on the pickler for instance) when we hit an array.

So if I understand correctly, at the end of the dump, this flag will be written at the beginning of the pickle file ? This will change the pickle format once again and we'll have to deal with 3 pickle formats:

valid pickle with no numpy arrays
old invalid pickles with numpy arrays, introduced in 0.10
new invalid pickles with numpy arrays That will be a pleasure to maintain ;)

Jan 09 '17 15:01 aabadie

So if I understand correctly, at the end of the dump, this flag will be written at the beginning of the pickle file ?

I think that we could save it in the metadata of memory object.

Jan 09 '17 15:01 GaelVaroquaux

I think that we could save it in the metadata of memory object.

I mean of the memory entry. What is currently stored in the .json file.

It's a bit of a hack, granted.

Jan 09 '17 15:01 GaelVaroquaux

I mean of the memory entry. What is currently stored in the .json file.

Ok

At load time, what about a try/except strategy ? :

try:
    pickle.load()
except:
   # Can fail if there's a numpy array in the pickle.
   joblib.load()

If there's no numpy array the load will use the C implementation of pickle otherwise, it will fail and use the joblib fallback mecanism.

Jan 09 '17 15:01 aabadie

At load time, what about a try/except strategy ? :

I thought about that. The problem is the array may be at the very end of the pickle so you waste a lot of time almost all the object and ultimately failing before trying the joblib.load strategy.

I don't think we can change the pickle format (adding the info somewhere in the file whether an array is present in the pickle) while still:

keeping streamability, which was one of the feature we strived to have during the single-file pickle PR
avoiding to walk the entire object to check to figure out whether an array is present or not

I think that we could save it in the metadata of memory object.

We could do that (I mentioned this option in https://github.com/joblib/joblib/issues/467#issuecomment-270848836).

I find adding a use_joblib_pickling argument a bit cleaner (as mentioned in https://github.com/joblib/joblib/issues/467#issuecomment-271266252), for example there is one less file to read at joblib.load time (metadata.json is not read when we do joblib.load currently). I guess the inconvenient with the use_joblib_pickling is the user has to make an informed decision.

Jan 09 '17 17:01 lesteve

Also, in the context of scikit-learn, the assumption is that estimators should be saved / loaded with joblib. However, for instance a TfidfVectorizer object (which contains no numpy arrays) is ~ 50x slower to load with joblib than with Python 3 pickle..

Apr 06 '17 13:04 rth

just to add another use-case for this, I'm trying out using joblib to cache sets of compiled Theano functions so I don't have to be constantly re-building them when I'm developing and re-starting things frequently. I previously rolled my own hash/pickle caching, but using joblib is much more elegant. The function sets end up being 200-250MB uncompressed, and loading/storing them is about 50-75% slower than when using pickle directly (this is on python 3.6, where pickle == cPickle).

Jul 10 '17 22:07 ssfrr

whoops - correction: it seems that when I turn compression off the speeds are actually pretty comparable, but the size on disk seems about twice as big as when I used pickle directly (using HIGHEST_PROTOCOL). Sorry for the mis-leading info.

Jul 10 '17 22:07 ssfrr

Did you try using lzma? I seem to remember that it's a pretty fast compression algorithm.

Jul 10 '17 22:07 GaelVaroquaux

I seem to remember that it's a pretty fast compression algorithm.

It depends on the data but generally it compress better with the cost of slower speed.

Jul 11 '17 06:07 aabadie

It depends on the data but generally it compress better with the cost of slower speed.

I've already made this mistake: I keep confusing it with another one (lz4, I believe). Maybe it would be useful to add in the joblib doc hints about which is fast and which compresses better.

Jul 11 '17 11:07 GaelVaroquaux

So, is there any hope for getting memory.cache to use normal pickle? It seems to at least double my load time.

Jan 09 '18 15:01 ghost

Here is a work-around. It may be a bit brittle, meaning that it might break on the next joblib release, but maybe that helps in your case.

import pickle
import os

import joblib

from joblib import memory


def my_load_output(output_dir, func_name, timestamp=None, metadata=None,
                 mmap_mode=None, verbose=0):
    """Load output of a computation."""
    if verbose > 1:
        signature = ""
        try:
            if metadata is not None:
                args = ", ".join(['%s=%s' % (name, value)
                                  for name, value
                                  in metadata['input_args'].items()])
                signature = "%s(%s)" % (os.path.basename(func_name),
                                             args)
            else:
                signature = os.path.basename(func_name)
        except KeyError:
            pass

        if timestamp is not None:
            t = "% 16s" % format_time(time.time() - timestamp)
        else:
            t = ""

        if verbose < 10:
            print('[Memory]%s: Loading %s...' % (t, str(signature)))
        else:
            print('[Memory]%s: Loading %s from %s' % (
                    t, str(signature), output_dir))

    filename = os.path.join(output_dir, 'output.pkl')
    if not os.path.isfile(filename):
        raise KeyError(
            "Non-existing cache value (may have been cleared).\n"
            "File %s does not exist" % filename)
    with open(filename, 'rb') as f:
        result = pickle.load(f)

    return result

original_load_output = memory._load_output
memory._load_output = my_load_output

mem = joblib.Memory('/tmp/test')

def identity(x):
    return x

cached_identity = mem.cache(identity)
cached_identity(3)

Maybe we could make it easier to override the dump and load function in derived classes of Memory. Not entirely sure whether there is hope of doing it in a cleaner way inside joblib. See https://github.com/joblib/joblib/issues/467#issuecomment-271345741 for what I think is a decent summary. You may be able to find more context if you search through issues.

Jan 09 '18 16:01 lesteve

joblib joblib copied to clipboard

slow memory retrieval (significantly slower then simple pickle)

joblib
joblib copied to clipboard