bcolz icon indicating copy to clipboard operation
bcolz copied to clipboard

Merge multiple DataFrames into a ctable

Open fredfortier opened this issue 7 years ago • 1 comments

I'm trying to do something which seems intuitively simple but I unable to accomplish.

I'm trying to populate a ctable from a stream of sequential data. Now data comes in a DataFrame always with the same structure. I use a DateTimeIndex to on each DataFrame.

Initializing a ctable from the first data chunk is simple: bcolz.ctable.fromdataframe(df, rootdir=bundle_root). However, I don't know what to do with the second chunk. I do want to append the new data as efficiently as possible, without having to put the entire ctable in memory or replace it entirely.

fredfortier avatar Jan 18 '18 01:01 fredfortier

What about ctable.append? You could create a ctable for each chunk from the dataframes as they come in and append ctable to ctable rather than trying to append and convert from a dataframe in one go.

A short example:

import pandas as pd
import numpy as np
import bcolz

df1 = pd.DataFrame({
    "x": np.array([1, 2, 3]),
    "y": np.array([4, 5, 6])
})

ct1 = bcolz.ctable.fromdataframe(df1)

df2 = pd.DataFrame({
    "x": np.array([11, 22, 33]),
    "y": np.array([44, 55, 66])
})

ct2 = bcolz.ctable.fromdataframe(df2)

ct1.append(ct2)

print(ct1)

Resulting in

[(1, 4) (2, 5) (3, 6) (11, 44) (22, 55) (33, 66)]

It would require either keeping df2 and ct2 in memory at the same time or writing ct2 to disk in a temp directory until it's appended, but I'm guessing that that wouldn't be an issue- at least not as much of one as putting the whole ctable in memory or creating a new ctable every time you get a new chunk.

Hope this helps!

ckingdev avatar Jul 22 '18 21:07 ckingdev