bcolz
bcolz copied to clipboard
Merge multiple DataFrames into a ctable
I'm trying to do something which seems intuitively simple but I unable to accomplish.
I'm trying to populate a ctable from a stream of sequential data. Now data comes in a DataFrame always with the same structure. I use a DateTimeIndex to on each DataFrame.
Initializing a ctable from the first data chunk is simple: bcolz.ctable.fromdataframe(df, rootdir=bundle_root)
. However, I don't know what to do with the second chunk. I do want to append the new data as efficiently as possible, without having to put the entire ctable in memory or replace it entirely.
What about ctable.append? You could create a ctable for each chunk from the dataframes as they come in and append ctable to ctable rather than trying to append and convert from a dataframe in one go.
A short example:
import pandas as pd
import numpy as np
import bcolz
df1 = pd.DataFrame({
"x": np.array([1, 2, 3]),
"y": np.array([4, 5, 6])
})
ct1 = bcolz.ctable.fromdataframe(df1)
df2 = pd.DataFrame({
"x": np.array([11, 22, 33]),
"y": np.array([44, 55, 66])
})
ct2 = bcolz.ctable.fromdataframe(df2)
ct1.append(ct2)
print(ct1)
Resulting in
[(1, 4) (2, 5) (3, 6) (11, 44) (22, 55) (33, 66)]
It would require either keeping df2
and ct2
in memory at the same time or writing ct2
to disk in a temp directory until it's appended, but I'm guessing that that wouldn't be an issue- at least not as much of one as putting the whole ctable in memory or creating a new ctable every time you get a new chunk.
Hope this helps!