arctic
arctic copied to clipboard
reduce chunkstore memory footprint
Two changes:
- Reduce memory footprint when reading data
- Handle duplicate columns in the filter.
Using a 1GB dataframe:
this PR:
master:
import numpy as np
import pandas as pd
from datetime import datetime as dt
from datetime import timedelta as td
days = 2000
secs = 15000
a1 = [range(secs) for _ in range(days)]
a2 = [[dt(2000,1,1)+td(days=x)]*secs for x in range(days)]
a3 = [['foo']*secs for _ in range(days)]
a4 = [np.random.rand(secs) for _ in range(days)]
a5 = [np.random.rand(secs) for _ in range(days)]
a6 = [['HOLIDAY INN WORLD CORP']*secs for _ in range(days)]
now = dt.now()
result = []
for i in range(days):
result.append(pd.DataFrame({'security_id':a1[i], 'date':a2[i], 'c':a3[i], 'd':a4[i], 'e':a5[i], 'f':a6[i]}, copy=True))
df = pd.concat(result)
print(df.shape)
print((dt.now() - now).total_seconds())
df = df.set_index(['date','security_id'])
print(df.memory_usage(index=True).sum() / 1e6)
from arctic import Arctic
import arctic
print(arctic.__file__)
a = Arctic('localhost')
a.initialize_library('test', lib_type='ChunkStoreV1')
lib = a['test']
lib.write('test', df)
del df
df = lib.read('test')
What's the memory saving? Have you measured it? 50% only 1 copy of data instead of 2?
Would be great to have a way to show the saving, and automated test to avoid any accidental regressions.
@TomTaylorLondon are you going to have the bandwidth to finish this or would you like me to resolve it?
Hi @TomTaylorLondon any luck with this?
@shashank88 I spoke with @TomTaylorLondon and am going to take this over from him. I'll get it all fixed up later this week(end).
@shashank88 I spoke with @TomTaylorLondon and am going to take this over from him. I'll get it all fixed up later this week(end).
👍