joblib icon indicating copy to clipboard operation
joblib copied to clipboard

slow memory retrieval (significantly slower then simple pickle)

Open MInner opened this issue 8 years ago • 37 comments

Hi,

I'm little confused by why does reading and writing to (file based) "memory" take so enormous amount of time compared to bare pickling/unpickling.

In my case, func() is a tiny memorized function that takes a short string argument and returns a (short) dict with (long) lists of ~complex objects. For some reason, function retrieval from cache takes significantly more time then just unpickleing file. Resulting file is approximately 70Mb.

I observe same thing for any other function.

%prun func(some_str)

        1   12.436   12.436   52.011   52.011 pickle.py:1014(load)
 41531482    7.665    0.000   11.931    0.000 pickle.py:226(read)
  1922386    5.547    0.000    7.339    0.000 pickle.py:1504(load_build)
 41531483    4.266    0.000    4.266    0.000 {method 'read' of '_io.BufferedReader' objects}
  6490284    3.753    0.000    6.666    0.000 pickle.py:1439(load_long_binput)
  2645763    2.666    0.000    4.764    0.000 pickle.py:1192(load_binunicode)
 30070039    2.403    0.000    2.403    0.000 {built-in method builtins.isinstance}
  4140172    1.870    0.000    3.225    0.000 pickle.py:1415(load_binget)
  1922386    1.369    0.000    2.049    0.000 pickle.py:1316(load_newobj)
  9196954    1.359    0.000    1.359    0.000 {built-in method _struct.unpack}
  1922386    1.114    0.000    8.724    0.000 numpy_pickle.py:319(load_build)
 10857316    0.962    0.000    0.962    0.000 {method 'pop' of 'list' objects}
 14536246    0.873    0.000    0.873    0.000 {method 'append' of 'list' objects}
  1922386    0.873    0.000    1.218    0.000 pickle.py:1472(load_setitem)
  1922393    0.816    0.000    0.816    0.000 {built-in method builtins.getattr}
   676815    0.765    0.000    1.384    0.000 pickle.py:1458(load_appends)
  1922387    0.730    0.000    0.832    0.000 pickle.py:1257(load_empty_dictionary)
        1    0.715    0.715   53.099   53.099 <string>:1(<module>)
  1245385    0.559    0.000    0.848    0.000 pickle.py:1451(load_append)
...

%prun len(pickle.load(open("..file..", 'rb')))
        1    4.587    4.587    4.587    4.587 {built-in method _pickle.load}
        1    0.553    0.553    5.140    5.140 <string>:1(<module>)
        1    0.000    0.000    5.140    5.140 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method io.open}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

MInner avatar Dec 21 '16 20:12 MInner

From the profiles I would guess this is because joblib is using the python pickle, mostly to be able to hack into the pickle mechanism and specialise it for numpy arrays. You are probably using python 3 for which pickle is the C pickle. Have a look at #421 for example where this was discussed in more details.

lesteve avatar Jan 03 '17 13:01 lesteve

Indeed. joblib is not very useful when dealing with Python objects with a large number of small sub objects. It's more useful for Python objects with a very few large subobjects (e.g. large numpy arrays).

I am afraid that there is little we can do about it.

ogrisel avatar Jan 04 '17 09:01 ogrisel

I am afraid that there is little we can do about it.

Would it be possible to write in the stored file whether or not there are numpy objects in it, and if there are none use the standard unpickler at load time?

GaelVaroquaux avatar Jan 04 '17 09:01 GaelVaroquaux

Walking the graph structure of such complex objects is actually quite costly (in Python, it's much faster in C and this is probably why the cPickle implementation is much faster in the first place).

ogrisel avatar Jan 05 '17 21:01 ogrisel

Walking the graph structure of such complex objects is actually quite costly.

But once the storage is finished, it's done, so after storing, we should know.

GaelVaroquaux avatar Jan 05 '17 23:01 GaelVaroquaux

In the Memory case, I guess whether output.pkl contains a numpy array or not could be written in a metadata file in the same folder (e.g. in metadata.json). This way you know at load-time you can use pickle.load directly rather than joblib.load?

lesteve avatar Jan 06 '17 07:01 lesteve

But once the storage is finished, it's done, so after storing, we should know.

This only solves half of the problem : storing is still slow but loading is fast. Also storing in the file itself whether it contains or not a numpy array also introduces another format change with backward compatibility necessity.

aabadie avatar Jan 09 '17 08:01 aabadie

In the Memory case, I guess whether output.pkl contains a numpy array or not could be written in a metadata file in the same folder (e.g. in metadata.json). This way you know at load-time you can use pickle.load directly rather than joblib.load?

Sounds like a potential solution. So pickle.dump should return something indicating whether there's a numpy array or not ? And the memory API is responsible of using it or not.

aabadie avatar Jan 09 '17 08:01 aabadie

s/pickle.load/pickle.dump/ in previous comment

aabadie avatar Jan 09 '17 08:01 aabadie

Actually thinking about it, maybe the cleanest thing to do is to add a use_joblib_pickling (for lack of a better name) argument to Memory, which should be True by default.

lesteve avatar Jan 09 '17 11:01 lesteve

Actually thinking about it, maybe the cleanest thing to do is to add a use_joblib_pickling argument to Memory, which should be True by default.

That would work, but the solution of storing whether or not it is a standard pickle would be automatic for the user, no?

I am not sure anymore: if there are no numpy arrays, is the stored file a standard pickle or isn't it?

GaelVaroquaux avatar Jan 09 '17 14:01 GaelVaroquaux

if there are no numpy arrays, is the stored file a standard pickle or isn't it?

it is

aabadie avatar Jan 09 '17 14:01 aabadie

it is

Awesome, so if we know that there are no numpy arrays in it, we can use the fast loading route, right?

GaelVaroquaux avatar Jan 09 '17 14:01 GaelVaroquaux

so if we know that there are no numpy arrays in it, we can use the fast loading route, right?

Yes but it can be costly to determine whether the arbitrary object contains a numpy array or not (if we use the python implementation of pickle).

aabadie avatar Jan 09 '17 14:01 aabadie

Yes but it can be costly to determine whether the arbitrary object contains a numpy array or not (if we use the python implementation of pickle).

At write time we can insert a code path in our pickler which detects that.

GaelVaroquaux avatar Jan 09 '17 15:01 GaelVaroquaux

a code path

What do you mean ?

aabadie avatar Jan 09 '17 15:01 aabadie

What do you mean ?

In our NumpyPickler, we can change a flag (on the pickler for instance) when we hit an array.

GaelVaroquaux avatar Jan 09 '17 15:01 GaelVaroquaux

In our NumpyPickler, we can change a flag (on the pickler for instance) when we hit an array.

So if I understand correctly, at the end of the dump, this flag will be written at the beginning of the pickle file ? This will change the pickle format once again and we'll have to deal with 3 pickle formats:

  • valid pickle with no numpy arrays
  • old invalid pickles with numpy arrays, introduced in 0.10
  • new invalid pickles with numpy arrays That will be a pleasure to maintain ;)

aabadie avatar Jan 09 '17 15:01 aabadie

So if I understand correctly, at the end of the dump, this flag will be written at the beginning of the pickle file ?

I think that we could save it in the metadata of memory object.

GaelVaroquaux avatar Jan 09 '17 15:01 GaelVaroquaux

I think that we could save it in the metadata of memory object.

I mean of the memory entry. What is currently stored in the .json file.

It's a bit of a hack, granted.

GaelVaroquaux avatar Jan 09 '17 15:01 GaelVaroquaux

I mean of the memory entry. What is currently stored in the .json file.

Ok

At load time, what about a try/except strategy ? :

try:
    pickle.load()
except:
   # Can fail if there's a numpy array in the pickle.
   joblib.load()

If there's no numpy array the load will use the C implementation of pickle otherwise, it will fail and use the joblib fallback mecanism.

aabadie avatar Jan 09 '17 15:01 aabadie

At load time, what about a try/except strategy ? :

I thought about that. The problem is the array may be at the very end of the pickle so you waste a lot of time almost all the object and ultimately failing before trying the joblib.load strategy.

I don't think we can change the pickle format (adding the info somewhere in the file whether an array is present in the pickle) while still:

  • keeping streamability, which was one of the feature we strived to have during the single-file pickle PR
  • avoiding to walk the entire object to check to figure out whether an array is present or not

I think that we could save it in the metadata of memory object.

We could do that (I mentioned this option in https://github.com/joblib/joblib/issues/467#issuecomment-270848836).

I find adding a use_joblib_pickling argument a bit cleaner (as mentioned in https://github.com/joblib/joblib/issues/467#issuecomment-271266252), for example there is one less file to read at joblib.load time (metadata.json is not read when we do joblib.load currently). I guess the inconvenient with the use_joblib_pickling is the user has to make an informed decision.

lesteve avatar Jan 09 '17 17:01 lesteve

Also, in the context of scikit-learn, the assumption is that estimators should be saved / loaded with joblib. However, for instance a TfidfVectorizer object (which contains no numpy arrays) is ~ 50x slower to load with joblib than with Python 3 pickle..

rth avatar Apr 06 '17 13:04 rth

just to add another use-case for this, I'm trying out using joblib to cache sets of compiled Theano functions so I don't have to be constantly re-building them when I'm developing and re-starting things frequently. I previously rolled my own hash/pickle caching, but using joblib is much more elegant. The function sets end up being 200-250MB uncompressed, and loading/storing them is about 50-75% slower than when using pickle directly (this is on python 3.6, where pickle == cPickle).

ssfrr avatar Jul 10 '17 22:07 ssfrr

whoops - correction: it seems that when I turn compression off the speeds are actually pretty comparable, but the size on disk seems about twice as big as when I used pickle directly (using HIGHEST_PROTOCOL). Sorry for the mis-leading info.

ssfrr avatar Jul 10 '17 22:07 ssfrr

Did you try using lzma? I seem to remember that it's a pretty fast compression algorithm.

GaelVaroquaux avatar Jul 10 '17 22:07 GaelVaroquaux

I seem to remember that it's a pretty fast compression algorithm.

It depends on the data but generally it compress better with the cost of slower speed.

aabadie avatar Jul 11 '17 06:07 aabadie

It depends on the data but generally it compress better with the cost of slower speed.

I've already made this mistake: I keep confusing it with another one (lz4, I believe). Maybe it would be useful to add in the joblib doc hints about which is fast and which compresses better.

GaelVaroquaux avatar Jul 11 '17 11:07 GaelVaroquaux

So, is there any hope for getting memory.cache to use normal pickle? It seems to at least double my load time.

ghost avatar Jan 09 '18 15:01 ghost

Here is a work-around. It may be a bit brittle, meaning that it might break on the next joblib release, but maybe that helps in your case.

import pickle
import os

import joblib

from joblib import memory


def my_load_output(output_dir, func_name, timestamp=None, metadata=None,
                 mmap_mode=None, verbose=0):
    """Load output of a computation."""
    if verbose > 1:
        signature = ""
        try:
            if metadata is not None:
                args = ", ".join(['%s=%s' % (name, value)
                                  for name, value
                                  in metadata['input_args'].items()])
                signature = "%s(%s)" % (os.path.basename(func_name),
                                             args)
            else:
                signature = os.path.basename(func_name)
        except KeyError:
            pass

        if timestamp is not None:
            t = "% 16s" % format_time(time.time() - timestamp)
        else:
            t = ""

        if verbose < 10:
            print('[Memory]%s: Loading %s...' % (t, str(signature)))
        else:
            print('[Memory]%s: Loading %s from %s' % (
                    t, str(signature), output_dir))

    filename = os.path.join(output_dir, 'output.pkl')
    if not os.path.isfile(filename):
        raise KeyError(
            "Non-existing cache value (may have been cleared).\n"
            "File %s does not exist" % filename)
    with open(filename, 'rb') as f:
        result = pickle.load(f)

    return result

original_load_output = memory._load_output
memory._load_output = my_load_output

mem = joblib.Memory('/tmp/test')

def identity(x):
    return x

cached_identity = mem.cache(identity)
cached_identity(3)

Maybe we could make it easier to override the dump and load function in derived classes of Memory. Not entirely sure whether there is hope of doing it in a cleaner way inside joblib. See https://github.com/joblib/joblib/issues/467#issuecomment-271345741 for what I think is a decent summary. You may be able to find more context if you search through issues.

lesteve avatar Jan 09 '18 16:01 lesteve