NumPy Optimizations and joblib Comparison
Hi, I'm currently using joblib for caching numpy array objects. Is there any benchmark on these kind of inputs for DiskCache?
Sorry, I don't have a benchmark. I looked at joblib but it didn't quite fit my needs. From what I see in the source, joblib uses a handful of tricks and tweaks to improve its handling of numpy arrays.
Could you describe your use case? I'm interested in constructing a benchmark if possible.
DiskCache lacks a memoizing decorator like joblib has but it's easy enough to write. Here's the simplest benchmark I could think of (using IPython):
In [6]: %paste
import diskcache
class Cache(diskcache.Cache):
def memoize(self, func):
def wrapper(*args):
try:
return self[args]
except KeyError:
value = func(*args)
self[args] = value
return value
return wrapper
cache = Cache('/tmp/diskcache')
@cache.memoize
def identity1(value):
print 'identity1', value
return value
%timeit -n1 -r50 identity1(0)
import joblib
memory = joblib.Memory('/tmp/joblib')
@memory.cache
def identity2(value):
print 'identity2', value
return value
%timeit -n1 -r50 identity2(0)
## -- End pasted text --
1 loop, best of 50: 16.9 µs per loop
1 loop, best of 50: 832 µs per loop
For the simple identity function above, DiskCache is about 50 times faster. Out of fifty iterations, the fastest lookup took 16.9 microseconds while joblib took 832 microseconds.
That is just great. Here is what I get in my laptop (SSD disk):
%timeit -n1 -r50 identity_diskcache(10) # 1 loops, best of 50: 67 µs per loop
%timeit -n1 -r50 identity_joblib(10) # 1 loops, best of 50: 217 µs per loop
import numpy as np
random = np.random.random(5000)
%timeit -n1 -r50 identity_diskcache(random) # 1 loops, best of 50: 257 µs per loop
%timeit -n1 -r50 identity_joblib(random) # 1 loops, best of 50: 770 µs per loop
You are also faster in retrieving numpy arrays.
Note that the scales will tip back for sufficiently large numpy arrays:
In [1]: %paste
values = np.random.random(int(1e6))
%timeit -n1 -r50 identity1(values)
%timeit -n1 -r50 identity2(values)
## -- End pasted text --
1 loop, best of 50: 42.1 ms per loop
1 loop, best of 50: 12.9 ms per loop
Now the numpy-aware joblib has an advantage.
Making DiskCache faster is then a matter of using the optimized serialization routines that numpy provides. DiskCache has a separate serialization class that handles converting to/from the database and filesystem. You can read about it at:
- http://www.grantjenks.com/docs/diskcache/tutorial.html#disk
- http://www.grantjenks.com/docs/diskcache/api.html#disk
I would be glad to accept pull-requests that created a numpy-aware diskcache.Disk-like serializer.
just as a side note: the fact that joblib is only usable as a decorator is actually a serious limitation IMHO in some cases. Take for instance this use case: I have a function that takes a file path and does some expensive computation on this. This is a large file and I am reading it in chunks in my function. I could compute a cache key based on that stream in my function and cache the results in my function. But with a decorator-only approach this is not possible: I cannot create a sub-function that has as an argument the whole file content such that joblib can compute the cache key for me. @ogrisel @GaelVaroquaux Is this a fair statement wrt to joblib capabilities?
I wonder if the performance difference for large numpy arrays is due to compression. Currently DiskCache does no compression of pickled objects. Depending on disk performance, I could imagine compression improving the performance of serializing large numpy arrays.
Nope, it is not compression. Local benchmarking showed compression was ten times slower.
Now I think the difference is the cryptographic hashing of inputs. Particularly for the identity function benchmark above, there's a significant impact.