python-diskcache icon indicating copy to clipboard operation
python-diskcache copied to clipboard

NumPy Optimizations and joblib Comparison

Open nournia opened this issue 9 years ago • 6 comments

Hi, I'm currently using joblib for caching numpy array objects. Is there any benchmark on these kind of inputs for DiskCache?

nournia avatar Mar 27 '16 06:03 nournia

Sorry, I don't have a benchmark. I looked at joblib but it didn't quite fit my needs. From what I see in the source, joblib uses a handful of tricks and tweaks to improve its handling of numpy arrays.

Could you describe your use case? I'm interested in constructing a benchmark if possible.

DiskCache lacks a memoizing decorator like joblib has but it's easy enough to write. Here's the simplest benchmark I could think of (using IPython):

In [6]: %paste
import diskcache

class Cache(diskcache.Cache):
    def memoize(self, func):
        def wrapper(*args):
            try:
                return self[args]
            except KeyError:
                value = func(*args)
                self[args] = value
                return value
        return wrapper

cache = Cache('/tmp/diskcache')

@cache.memoize
def identity1(value):
    print 'identity1', value
    return value

%timeit -n1 -r50 identity1(0)

import joblib

memory = joblib.Memory('/tmp/joblib')

@memory.cache
def identity2(value):
    print 'identity2', value
    return value

%timeit -n1 -r50 identity2(0)
## -- End pasted text --
1 loop, best of 50: 16.9 µs per loop
1 loop, best of 50: 832 µs per loop

For the simple identity function above, DiskCache is about 50 times faster. Out of fifty iterations, the fastest lookup took 16.9 microseconds while joblib took 832 microseconds.

grantjenks avatar Mar 28 '16 04:03 grantjenks

That is just great. Here is what I get in my laptop (SSD disk):

%timeit -n1 -r50 identity_diskcache(10)  # 1 loops, best of 50: 67 µs per loop
%timeit -n1 -r50 identity_joblib(10)  # 1 loops, best of 50: 217 µs per loop

import numpy as np
random = np.random.random(5000)
%timeit -n1 -r50 identity_diskcache(random)  # 1 loops, best of 50: 257 µs per loop
%timeit -n1 -r50 identity_joblib(random)  # 1 loops, best of 50: 770 µs per loop

You are also faster in retrieving numpy arrays.

nournia avatar Mar 28 '16 08:03 nournia

Note that the scales will tip back for sufficiently large numpy arrays:

In [1]: %paste
values = np.random.random(int(1e6))
%timeit -n1 -r50 identity1(values)
%timeit -n1 -r50 identity2(values)
## -- End pasted text --
1 loop, best of 50: 42.1 ms per loop
1 loop, best of 50: 12.9 ms per loop

Now the numpy-aware joblib has an advantage.

Making DiskCache faster is then a matter of using the optimized serialization routines that numpy provides. DiskCache has a separate serialization class that handles converting to/from the database and filesystem. You can read about it at:

  • http://www.grantjenks.com/docs/diskcache/tutorial.html#disk
  • http://www.grantjenks.com/docs/diskcache/api.html#disk

I would be glad to accept pull-requests that created a numpy-aware diskcache.Disk-like serializer.

grantjenks avatar Mar 28 '16 17:03 grantjenks

just as a side note: the fact that joblib is only usable as a decorator is actually a serious limitation IMHO in some cases. Take for instance this use case: I have a function that takes a file path and does some expensive computation on this. This is a large file and I am reading it in chunks in my function. I could compute a cache key based on that stream in my function and cache the results in my function. But with a decorator-only approach this is not possible: I cannot create a sub-function that has as an argument the whole file content such that joblib can compute the cache key for me. @ogrisel @GaelVaroquaux Is this a fair statement wrt to joblib capabilities?

pombredanne avatar Jun 01 '16 08:06 pombredanne

I wonder if the performance difference for large numpy arrays is due to compression. Currently DiskCache does no compression of pickled objects. Depending on disk performance, I could imagine compression improving the performance of serializing large numpy arrays.

grantjenks avatar Sep 12 '16 05:09 grantjenks

Nope, it is not compression. Local benchmarking showed compression was ten times slower.

Now I think the difference is the cryptographic hashing of inputs. Particularly for the identity function benchmark above, there's a significant impact.

grantjenks avatar Sep 12 '16 18:09 grantjenks