bcolz efficient bcolz.ctable itersequence/read

It looks like bcolz.ctable[list_of_indices] is very slow:

Python 3.5.1 |Anaconda custom (64-bit)| (default, Dec  7 2015, 11:16:01)
In [1]: %paste
In [1]: import numpy as np

In [2]: import bcolz

In [3]: from collections import namedtuple

In [4]: from itertools import islice

In [5]: bcolz.print_versions()
## -- End pasted text --
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
bcolz version:     1.1.1.dev16
NumPy version:     1.11.1
Blosc version:     1.9.3 ($Date:: 2016-07-06 #$)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Numexpr version:   2.6.1
Dask version:      0.11.0
Python version:    3.5.1 |Anaconda custom (64-bit)| (default, Dec  7 2015, 11:16:01)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Platform:          linux-x86_64
Byte-ordering:     little
Detected cores:    4
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

In [2]: %paste
In [6]: N = int(1e6)

In [7]: ct = bcolz.fromiter(((i,i*i) for i in range(N)), dtype="i4,f8", count=N, rootdir='bz_test', mode='w')

In [8]: from random import randint, sample

In [9]: list_of_indx = sample(range(1000000), k=100000)

In [10]: %time ct[list_of_indx]
## -- End pasted text --
CPU times: user 8.56 s, sys: 1.44 s, total: 10 s
Wall time: 10 s
Out[2]:
array([(807054, 651336158916.0), (882754, 779254624516.0),
       (918705, 844018877025.0), ..., (799379, 639006785641.0),
       (620003, 384403720009.0), (42781, 1830213961.0)],
      dtype=[('f0', '<i4'), ('f1', '<f8')])

I wonder if it's possible for bcolz to have itersequence/read_coordinates methods that are as efficient as they are in pytables?

Thank you, Anton

Sep 07 '16 18:09 ankravch

Well, everything can be made faster, but how much faster is PyTables in this situation?

Sep 08 '16 07:09 FrancescAlted

right, here is a benchmark against pytables.

In [1]: import tables

In [2]: tables.__version__
Out[2]: '3.2.4.dev0'

In [3]: import bcolz

In [4]: bcolz.print_versions()
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
bcolz version:     1.1.1.dev18+dirty
NumPy version:     1.11.1
Blosc version:     1.9.3 ($Date:: 2016-07-06 #$)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Numexpr version:   2.6.1
Dask version:      0.11.0
Python version:    3.5.1 |Anaconda custom (64-bit)| (default, Dec  7 2015, 11:16:01)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Platform:          linux-x86_64
Byte-ordering:     little
Detected cores:    4
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

In [5]: N = int(1e6)

In [6]: ct = bcolz.fromiter(((i,i*i) for i in range(N)), dtype="i4,f8", count=N, rootdir='bz_test', mode='w')

In [7]: from random import sample

In [8]: indx = sample(range(1000000), k=100000)

In [9]: %time ct[indx]
CPU times: user 8.85 s, sys: 1.52 s, total: 10.4 s
Wall time: 10.4 s
Out[9]:
array([(33992, 1155456064.0), (274165, 75166447225.0),
       (723780, 523857488400.0), ..., (186771, 34883406441.0),
       (488874, 238997787876.0), (618955, 383105292025.0)],
      dtype=[('f0', '<i4'), ('f1', '<f8')])

In [10]: ct.tohdf5('h5_test.h5', '/test', mode='w')

In [11]: h5_table = tables.open_file('h5_test.h5').root.test

In [12]: h5_table
Out[12]:
/test (Table(1000000,), shuffle, blosc:lz4(5)) ''
  description := {
  "f0": Int32Col(shape=(), dflt=0, pos=0),
  "f1": Float64Col(shape=(), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (10922,)

In [13]: %time h5_table.read_coordinates(indx)
CPU times: user 1.09 s, sys: 16 ms, total: 1.11 s
Wall time: 1.11 s

Sep 08 '16 14:09 ankravch

Oh yes, that's a lot of difference, and that should be fixed. Pull requests are welcome!

Sep 08 '16 15:09 FrancescAlted

It looks like reading random coordinates of bcolz.ctable comes down to reading random coordinates of carrays. And from http://bcolz.readthedocs.io/en/latest/tutorial.html: 'Keep in mind that carrays are great for sequential access, but random access will highly likely trigger decompression of a different chunk for each randomly accessed value.'

Do you think it could be an issue in this case?

Sep 08 '16 16:09 ankravch

Not really because PyTables also has to deal with chunked data, so the bottleneck must be in some other place. Doing a profile would be helpful.

Sep 08 '16 17:09 FrancescAlted

Bcolz.ctable reading random coordinates

In [13]: import cProfile

In [14]: cProfile.run('print (ct[indx])')
[(276899, 76673056201.0) (307470, 94537800900.0) (466150, 217295822500.0)
 ..., (33871, 1147244641.0) (292008, 85268672064.0) (238873, 57060310129.0)]
         8059373 function calls (7959361 primitive calls) in 13.498 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:996(_handle_fromlist)
        1    0.000    0.000   13.498   13.498 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 arrayprint.py:208(_leading_trailing)
        6    0.000    0.000    0.000    0.000 arrayprint.py:234(repr_format)
        1    0.000    0.000    0.000    0.000 arrayprint.py:237(_array2string)
        6    0.000    0.000    0.000    0.000 arrayprint.py:328(_convert_arrays)
      7/1    0.000    0.000    0.000    0.000 arrayprint.py:340(array2string)
        7    0.000    0.000    0.000    0.000 arrayprint.py:450(_extendLine)
        1    0.000    0.000    0.000    0.000 arrayprint.py:458(_formatArray)
        3    0.000    0.000    0.000    0.000 arrayprint.py:529(__init__)
        3    0.000    0.000    0.000    0.000 arrayprint.py:543(fillFormat)
        1    0.000    0.000    0.000    0.000 arrayprint.py:635(__init__)
        3    0.000    0.000    0.000    0.000 arrayprint.py:657(__init__)
        1    0.000    0.000    0.000    0.000 arrayprint.py:685(__init__)
        1    0.000    0.000    0.000    0.000 arrayprint.py:696(__init__)
        1    0.000    0.000    0.000    0.000 arrayprint.py:713(__init__)
        1    0.000    0.000    0.000    0.000 arrayprint.py:734(__init__)
   200004    0.093    0.000    0.255    0.000 carray_ext.pyx:1020(__get__)
   200004    0.113    0.000    0.162    0.000 carray_ext.pyx:1030(__get__)
   200000    0.175    0.000   10.414    0.000 carray_ext.pyx:1813(getitem_cache)
   200000    0.507    0.000   11.001    0.000 carray_ext.pyx:1900(__getitem__)
   186696    0.080    0.000    0.080    0.000 carray_ext.pyx:345(__cinit__)
   195252    2.462    0.000    2.462    0.000 carray_ext.pyx:514(_getitem)
   746784    0.065    0.000    0.065    0.000 carray_ext.pyx:649(decode_byte)
   560088    0.206    0.000    0.206    0.000 carray_ext.pyx:654(decode_uint32)
   186696    0.534    0.000    0.805    0.000 carray_ext.pyx:658(decode_blosc_header)
   186696    0.021    0.000    0.021    0.000 carray_ext.pyx:710(__get__)
   186696    0.192    0.000    0.882    0.000 carray_ext.pyx:717(__get__)
   186696    0.415    0.000    1.659    0.000 carray_ext.pyx:759(_chunk_file_name)
   186696    4.070    0.000    7.367    0.000 carray_ext.pyx:763(read_chunk)
   196458    0.330    0.000    7.777    0.000 carray_ext.pyx:793(__getitem__)
   400004    0.058    0.000    0.058    0.000 carray_ext.pyx:987(__get__)
   400004    0.100    0.000    0.100    0.000 carray_ext.pyx:992(__get__)
 100001/1    0.648    0.000   13.498   13.498 ctable.py:1194(__getitem__)
   100000    0.334    0.000   11.388    0.000 ctable.py:1224(<listcomp>)
        2    0.000    0.000    0.000    0.000 ctable.py:1240(<genexpr>)
   100001    0.178    0.000   13.356    0.000 ctable.py:1251(<genexpr>)
   100002    0.543    0.000    0.952    0.000 ctable.py:189(dtype)
   200002    0.055    0.000    0.055    0.000 ctable.py:202(names)
   400004    0.095    0.000    0.095    0.000 ctable.py:83(__getitem__)
   186696    0.132    0.000    0.833    0.000 genericpath.py:16(exists)
      7/1    0.000    0.000    0.000    0.000 numeric.py:1835(array_str)
        6    0.000    0.000    0.000    0.000 numeric.py:2576(seterr)
        6    0.000    0.000    0.000    0.000 numeric.py:2676(geterr)
        3    0.000    0.000    0.000    0.000 numeric.py:2963(__init__)
        3    0.000    0.000    0.000    0.000 numeric.py:2967(__enter__)
        3    0.000    0.000    0.000    0.000 numeric.py:2972(__exit__)
   373392    0.159    0.000    0.264    0.000 posixpath.py:39(_get_sep)
   373392    0.562    0.000    1.031    0.000 posixpath.py:71(join)
        1    0.000    0.000    0.000    0.000 {built-in method _functools.reduce}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.all}
        1    0.000    0.000   13.498   13.498 {built-in method builtins.exec}
       20    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
   473425    0.150    0.000    0.150    0.000 {built-in method builtins.isinstance}
        6    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
       21    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        6    0.000    0.000    0.000    0.000 {built-in method builtins.repr}
        1    0.014    0.014    0.014    0.014 {built-in method numpy.core.multiarray.array}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.concatenate}
        1    0.127    0.127   13.483   13.483 {built-in method numpy.core.multiarray.fromiter}
       12    0.000    0.000    0.000    0.000 {built-in method numpy.core.umath.geterrobj}
        6    0.000    0.000    0.000    0.000 {built-in method numpy.core.umath.seterrobj}
   186696    0.701    0.000    0.701    0.000 {built-in method posix.stat}
   200016    0.049    0.000    0.049    0.000 {method 'append' of 'list' objects}
   100000    0.122    0.000    0.122    0.000 {method 'copy' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
   373392    0.077    0.000    0.077    0.000 {method 'endswith' of 'str' objects}
        6    0.000    0.000    0.000    0.000 {method 'item' of 'numpy.ndarray' objects}
        3    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
       15    0.000    0.000    0.000    0.000 {method 'rstrip' of 'str' objects}
   373392    0.128    0.000    0.128    0.000 {method 'startswith' of 'str' objects}

PyTables read_coordinates

In [18]: cProfile.run('print (h5_table.read_coordinates(indx))')
[(276899, 76673056201.0) (307470, 94537800900.0) (466150, 217295822500.0)
 ..., (33871, 1147244641.0) (292008, 85268672064.0) (238873, 57060310129.0)]
         263 function calls (251 primitive calls) in 1.087 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       10    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:996(_handle_fromlist)
        1    0.000    0.000    1.087    1.087 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 _methods.py:37(_any)
        1    0.000    0.000    0.000    0.000 arrayprint.py:208(_leading_trailing)
        6    0.000    0.000    0.000    0.000 arrayprint.py:234(repr_format)
        1    0.000    0.000    0.000    0.000 arrayprint.py:237(_array2string)
        6    0.000    0.000    0.000    0.000 arrayprint.py:328(_convert_arrays)
      7/1    0.000    0.000    0.000    0.000 arrayprint.py:340(array2string)
        7    0.000    0.000    0.000    0.000 arrayprint.py:450(_extendLine)
        1    0.000    0.000    0.000    0.000 arrayprint.py:458(_formatArray)
        3    0.000    0.000    0.000    0.000 arrayprint.py:529(__init__)
        3    0.000    0.000    0.000    0.000 arrayprint.py:543(fillFormat)
        1    0.000    0.000    0.000    0.000 arrayprint.py:635(__init__)
        3    0.000    0.000    0.000    0.000 arrayprint.py:657(__init__)
        1    0.000    0.000    0.000    0.000 arrayprint.py:685(__init__)
        1    0.000    0.000    0.000    0.000 arrayprint.py:696(__init__)
        1    0.000    0.000    0.000    0.000 arrayprint.py:713(__init__)
        1    0.000    0.000    0.000    0.000 arrayprint.py:734(__init__)
        1    0.000    0.000    0.000    0.000 flavor.py:113(array_of_flavor2)
        1    0.000    0.000    0.000    0.000 flavor.py:136(flavor_to_flavor)
        1    0.000    0.000    0.000    0.000 flavor.py:158(internal_to_flavor)
        1    0.000    0.000    0.000    0.000 flavor.py:368(conv_to_numpy)
        1    0.000    0.000    0.000    0.000 flavor.py:380(_conv_numpy_to_numpy)
        2    0.000    0.000    0.000    0.000 fromnumeric.py:1892(any)
        1    0.000    0.000    0.000    0.000 leaf.py:203(flavor)
        1    0.001    0.001    0.013    0.013 leaf.py:497(_point_selection)
        1    0.000    0.000    0.000    0.000 node.py:336(_g_check_open)
      7/1    0.000    0.000    0.000    0.000 numeric.py:1835(array_str)
        6    0.000    0.000    0.000    0.000 numeric.py:2576(seterr)
        6    0.000    0.000    0.000    0.000 numeric.py:2676(geterr)
        3    0.000    0.000    0.000    0.000 numeric.py:2963(__init__)
        3    0.000    0.000    0.000    0.000 numeric.py:2967(__enter__)
        3    0.000    0.000    0.000    0.000 numeric.py:2972(__exit__)
        2    0.000    0.000    0.000    0.000 numeric.py:414(asarray)
        2    0.000    0.000    0.000    0.000 numeric.py:484(asanyarray)
        1    0.000    0.000    1.078    1.078 table.py:1932(_read_coordinates)
        1    0.009    0.009    1.087    1.087 table.py:1966(read_coordinates)
        2    0.000    0.000    0.000    0.000 table.py:534(shape)
        1    0.000    0.000    0.000    0.000 table.py:959(_get_container)
        1    0.000    0.000    0.000    0.000 {built-in method _functools.reduce}
        1    0.000    0.000    1.087    1.087 {built-in method builtins.exec}
       21    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
       32    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        6    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
       23    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.min}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.print}
        6    0.000    0.000    0.000    0.000 {built-in method builtins.repr}
        5    0.012    0.002    0.012    0.002 {built-in method numpy.core.multiarray.array}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.concatenate}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.empty}
       12    0.000    0.000    0.000    0.000 {built-in method numpy.core.umath.geterrobj}
        6    0.000    0.000    0.000    0.000 {built-in method numpy.core.umath.seterrobj}
        1    1.065    1.065    1.065    1.065 {method '_read_elements' of 'tables.tableextension.Table' objects}
        2    0.000    0.000    0.000    0.000 {method 'any' of 'numpy.ndarray' objects}
       12    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        6    0.000    0.000    0.000    0.000 {method 'item' of 'numpy.ndarray' objects}
        3    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
        3    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
       15    0.000    0.000    0.000    0.000 {method 'rstrip' of 'str' objects}

Sep 08 '16 18:09 ankravch

Ignoring the other (apparent?) issues in that profile, I was curious about the impact of simply coalescing the chunk accesses. Created a proof-of-concept that presorts the indices, then reshuffles before returning. Costs you double the memory (hey, it's a POC after all), but demonstrates the performance impact of the issue.

On my computer, reading the 100,000 random indices takes ~7.90sec. Presorting and reshuffling drops that to ~415ms.

Sep 08 '16 22:09 pfheatwole

Created a proof-of-concept that presorts the indices, then reshuffles before returning.

I can confirm that having included that patch, a run time goes down significantly in this case.

In [4]: %paste
In [5]: N = int(1e6)

In [6]: ct = bcolz.fromiter(((i,i*i) for i in range(N)), dtype="i4,f8", count=N, rootdir='bz_test', mode='w')

In [7]: from random import sample

In [8]: indx = sample(range(1000000), k=100000)

In [9]: %time ct[indx]

## -- End pasted text --
CPU times: user 809 ms, sys: 3 ms, total: 812 ms
Wall time: 808 ms
Out[4]:
array([(428294, 183435750436.0), (37513, 1407225169.0),
       (518376, 268713677376.0), ..., (663849, 440695494801.0),
       (115971, 13449272841.0), (203379, 41363017641.0)],
      dtype=[('f0', '<i4'), ('f1', '<f8')])

Sep 08 '16 23:09 ankravch

efficient bcolz.ctable itersequence/read_coordinates methods