bcolz icon indicating copy to clipboard operation
bcolz copied to clipboard

question: is there a way to do in-place arithmetic operations?

Open suntzu86 opened this issue 8 years ago • 4 comments

At the moment, I don't see a great way to apply a transform to a carray besides re-allocating all of the data, like:

my_ctable['f0'] = my_ctable.eval('f0 + 1', cparams=my_ctable.cparams)  # if you don't pass cparams, eval will return a column with bcolz's default cparams; same goes for the other kwargs to ctable's init

for more complex transforms (like logarithms, transcendental fcns, etc) the difference is smaller but for smaller computation, the above is much slower than say my_numpy_ndarray += 1. Plus if I'm applying some kind of transform to every column, that's worst-case duplicating all the data, which is a lot (hopefully gc cleans up before too long)

Is there a better way?

If not, any chance of adding this feature? I can see this being difficult or even impossible since data is likely stored compressed, but maybe there's a way? Or only supporting it for compression=0 types (feels yucky).

suntzu86 avatar Jun 16 '16 05:06 suntzu86

Well, I have added the capability (see 24c4d5c) to preserve the cparams during my_ctable['f0'] = x + 1, so you don't strictly need to use .eval() (if you have an already existing x ndarray). This allows for better speed (~1.5x). If you have better suggestions, I'd be happy to discuss about them.

FrancescAlted avatar Jun 16 '16 12:06 FrancescAlted

Cool! Yay speedups :) I don't think I understand how carray works under the hood enough to offer much intelligence here :/

I was also asking about the case where 'x' is a carray. It seems that at worst, adding (or any arithmetic ops) 2 carrays, x + y means:

  • decompress x, decompress y
  • add
  • recompress the result (using what cparams?)

Which is probably very expensive even if you're calling more costly functions like log, and decompress/recompress would easily dominate the time of addition.

And if I wanted to do x + y + z, that would compound the decompress/recompress. So maybe...

  • handle this case but only if compression level = 0
    • even here it's not obvious--maybe most natural to let numpy do the math but for GB of data, that's a lot of copying back & forth btwn bcolz and numpy
  • move compression into a more inner loop and chunk it so that we decompress some portion of x, y, z, do the ops, and then recompress only the final result (all intermediates stay uncompressed)
    • maybe do it lazily? so ops don't evaluate right now but only when the user asks for it. So 'z = x + y' and then 'z += 2' becomes z = x + y + 2 internally and only evaluated when the user requires it. Would probably want an expression tree or parser that knows how to eliminate constants, common subexpressions, etc... which is sounding big and complex.
  • or is there a compression scheme that is amenable to arithmetic operations? (like somehow I can add, multiply, etc. without decompress/recompress)

suntzu86 avatar Jun 20 '16 21:06 suntzu86

Basically like I said, I don't think I have any good ideas :( Maybe you see a path through it?

The use case is for example (down)loading data, doing some transforms, and then sending it off to some ML tool for consumption.

suntzu86 avatar Jun 20 '16 21:06 suntzu86

Here you are supposing that compression/decompression is a very expensive operation, and it is not so, at least on modern CPUs with several cores. Also, bcolz is highly optimized for performing operations in a block-wise way for better cache usage. For example, with an CPU with 4 physical cores I am getting this:

In [1]: import numpy as np

In [2]: import bcolz

In [3]: N = int(1e7); ct = bcolz.fromiter(((i, i * 3, i * 2.) for i in xrange(N)), dtype='i4,f4,f8', count=N)

In [4]: ct
Out[4]: 
ctable((10000000,), [('f0', '<i4'), ('f1', '<f4'), ('f2', '<f8')])
  nbytes: 152.59 MB; cbytes: 6.33 MB; ratio: 24.11
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
[(0, 0.0, 0.0) (1, 3.0, 2.0) (2, 6.0, 4.0) ...,
 (9999997, 29999992.0, 19999994.0) (9999998, 29999994.0, 19999996.0)
 (9999999, 29999996.0, 19999998.0)]

In [5]: %timeit ct.eval('f0 + f1 + f2')
10 loops, best of 3: 55.5 ms per loop

# Using pure numpy here:
In [6]: ra = ct[:]

In [7]: ra
Out[7]: 
array([(0, 0.0, 0.0), (1, 3.0, 2.0), (2, 6.0, 4.0), ...,
       (9999997, 29999992.0, 19999994.0),
       (9999998, 29999994.0, 19999996.0), (9999999, 29999996.0, 19999998.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<f8')])

In [8]: %timeit ra['f0'] + ra['f1'] + ra['f2']
10 loops, best of 3: 72.1 ms per loop

Doing the same operation without compression is actually slower:

In [10]: bcolz.defaults.cparams['clevel'] = 0

In [11]: N = int(1e7); ct = bcolz.fromiter(((i, i * 3, i * 2.) for i in xrange(N)), dtype='i4,f4,f8', count=N)

In [12]: ct
Out[12]: 
ctable((10000000,), [('f0', '<i4'), ('f1', '<f4'), ('f2', '<f8')])
  nbytes: 152.59 MB; cbytes: 153.50 MB; ratio: 0.99
  cparams := cparams(clevel=0, shuffle=1, cname='lz4', quantize=0)
[(0, 0.0, 0.0) (1, 3.0, 2.0) (2, 6.0, 4.0) ...,
 (9999997, 29999992.0, 19999994.0) (9999998, 29999994.0, 19999996.0)
 (9999999, 29999996.0, 19999998.0)]

In [13]: %timeit ct.eval('f0 + f1 + f2')
10 loops, best of 3: 65.2 ms per loop

Regarding lazy operations, yes, depending on your needs that could be a win. Dask does this already, so not sure we should replicate it in bcolz. bcolz is more meant to let the user specify the complete amount of operations in a single expression and then execute in one go (which is usually very efficient).

FrancescAlted avatar Jun 21 '16 07:06 FrancescAlted