pystore Is append loading the entire data into memory just to append new data ?

Is append loading the entire data into memory just to append new data ?

Open ovresko opened this issue 3 years ago • 1 comments

based on this code , on each append we load all the data into memory to check for duplicates then doing a write on all the data to rewrite parquet. doing that for some items with 100k existing record with multiple threads, the task is consuming 100% of memory for each 1 record append

why not use fastparquet write method to append the data, (with True / False / overwrite) https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write

 try:
          if epochdate or ("datetime" in str(data.index.dtype) and
                           any(data.index.nanosecond) > 0):
              data = utils.datetime_to_int64(data)
          old_index = dd.read_parquet(self._item_path(item, as_string=True),
                                      columns=[], engine=self.engine
                                      ).index.compute()
          data = data[~data.index.isin(old_index)]
      except Exception:
          return

      if data.empty:
          return

      if data.index.name == "":
          data.index.name = "index"

      # combine old dataframe with new
      current = self.item(item)
      new = dd.from_pandas(data, npartitions=1)
      combined = dd.concat([current.data, new]).drop_duplicates(keep="last")

Aug 27 '21 17:08 ovresko

Hi @ovresko , for information, I have started an alternative lib, oups that has some similarities with pystore. Please, beware this is my first project, but I would gladly accept any feedback on it.

@ranaroussi, I am aware this post may not be welcome and I am sorry if it is a bit rude. Please, remove it if it does.

Jan 18 '22 09:01 yohplala

pystore pystore copied to clipboard

Is append loading the entire data into memory just to append new data ?

pystore
pystore copied to clipboard