pystore
pystore copied to clipboard
Is append loading the entire data into memory just to append new data ?
based on this code , on each append we load all the data into memory to check for duplicates then doing a write on all the data to rewrite parquet. doing that for some items with 100k existing record with multiple threads, the task is consuming 100% of memory for each 1 record append
why not use fastparquet write method to append the data, (with True / False / overwrite) https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write
try:
if epochdate or ("datetime" in str(data.index.dtype) and
any(data.index.nanosecond) > 0):
data = utils.datetime_to_int64(data)
old_index = dd.read_parquet(self._item_path(item, as_string=True),
columns=[], engine=self.engine
).index.compute()
data = data[~data.index.isin(old_index)]
except Exception:
return
if data.empty:
return
if data.index.name == "":
data.index.name = "index"
# combine old dataframe with new
current = self.item(item)
new = dd.from_pandas(data, npartitions=1)
combined = dd.concat([current.data, new]).drop_duplicates(keep="last")
Hi @ovresko , for information, I have started an alternative lib, oups that has some similarities with pystore
. Please, beware this is my first project, but I would gladly accept any feedback on it.
@ranaroussi, I am aware this post may not be welcome and I am sorry if it is a bit rude. Please, remove it if it does.