bcolz icon indicating copy to clipboard operation
bcolz copied to clipboard

Reading selected columns into memory from an on-disc ctable

Open TroubleMak3r opened this issue 9 years ago • 5 comments

Hi,

As far as I have determined in the docs bcolz optimizes read and write operations of specific columns using columnar storage. However, I can not find any function that would make it possible.

So far I have created an enormous ctable with bcolz which is stored on disc. The ctable has the dimensions of about 100 000 columns (or more) and 100 000 rows (I'm working on a slightly smaller subset now). What I'm trying to achieve is to read into memory (pandas dataframe to be exact) around 10 to 100 selected columns. There is a method called todataframe which has an argument "columns", but it operates on ctable objects. And the only way to open an existing ctable that I have found is by using "ctable.open" function, which reads all data into memory, which is incredibly slow and inefficient for my dataset and problem. I'm guessing I must have omitted some function or parameter, which just points to the data stored on disc, and creates some kind of 'handler', on which I can later use the "todataframe" method. That would speed things up and dramatically decrease memory consumption.

Could you possibly tell me if such function exists? Or do I have to write it myself?

Regards, Dominik

TroubleMak3r avatar Aug 30 '16 11:08 TroubleMak3r

Hi Dominik,

If you have an on-disk ctable called mytable.bcolz, you would open it by passing the rootdir parameter to the ctable constructor: ct = bcolz.ctable(rootdir='mytable.bcolz'). The ct variable is a ctable object that contains the metadata for accessing the content of the on-disk ctable; by itself, opening the on-disk table will not read the entire table into memory.

If you want to create an in-memory dataframe that contains entire columns from ct, you can use column-indexing (ala Pandas, ct[['col1', 'col2']]) to create another ctable object and call .todataframe() on that. For example, if ct has columns A, B, C, and D, but you only want columns B and D:

ct = bcolz.ctable(rootdir='mytable.bcolz')
df = ct[['B', 'D']].todataframe()

Hope that helps!

pfheatwole avatar Aug 31 '16 21:08 pfheatwole

Thanks for the reply!

That's exactly what I've been trying to do before I asked the question: data_frame = bcolz.ctable(rootdir='files/ctable300/') or data_frame = bcolz.ctable(rootdir='files/ctable300/', mode='r')

But to no avail. It tries to read the data, but the memory consumption rises dramatically to 1GB and more, and is eventually unable to open the dataset due to lack of RAM - it's evident, that it tries to read it into memory, instead of creating a 'handler' to the file. It seems to behave the same way as with "open" function - same memory consuming behavior.

Or maybe I'm creating the ctable the wrong way? I did it with bcolz.ctable.fromdataframe(data_frame,rootdir='files/ctable300') as I had the data loaded into memory.

However, I was able to come up with a workaround - I simply wrote a new method based on '.open' method, which instead of iterating over all columns from rootdir file, reads only the ones that I need, passed in a list as "columns" parameter. The code probably looks a little crude, as I'm a beginner in python, but it works like a charm.

TroubleMak3r avatar Sep 01 '16 10:09 TroubleMak3r

Ah, I see. If I had to guess, your issue is with the lastchunk component of each carray. Each open carray reads in the "leftover" portion of data that wasn't big enough to fill a chunk, probably for simplicity elsewhere in the code. Suppose a column with a 32KB chunksize stored 63KB of data, then that carray would hold 31KB of uncompressed data. At 100,000 columns, that leftover could quickly become an issue.

Maybe a bcolz developer would have a better solution, but for the moment I think you have the right idea with your workaround. One small suggestion for simplicity would be to reuse the 'names' parameter of the ctable constructor instead of making up a totally new function. I threw up an example here but only had time for a cursory test, so use with caution. Minimally invasive and lets you keep a similar API, by doing ct = bcolz.ctable(rootdir='files/ctable300', names=['col1', 'col32']) to open a ctable with only a subset of the available columns.

pfheatwole avatar Sep 01 '16 18:09 pfheatwole

Yes, @gogoengie is probably right in that the lastchunk buffer is the responsible for all the allocated memory. A possible solution would be to make lastchunk lazy and create it only if necessary. But perhaps the best would be to use his names suggestion to load justa 'view' of the table. Pull requests on these options are welcome.

FrancescAlted avatar Sep 01 '16 19:09 FrancescAlted

Thanks a lot for the explanation and tips guys!

TroubleMak3r avatar Sep 02 '16 09:09 TroubleMak3r