arctic icon indicating copy to clipboard operation
arctic copied to clipboard

get index of the last element in Chunkstore

Open KonstantinLitvin opened this issue 4 years ago • 5 comments

Is there any way to get index of the last element in chunckstore except reading last whole chunk. I'd like to implement functionality to append new data to chunckstore without producing copies of the same elements. Basically I have a new data_frame which overlaps with the data_frame in chucksotre and I need to cut it to make it starts from the end of df in chunkstore. I've tried to write metadata with last date_time_index but maybe there is more elegant way to do that? I thought also about using update(...) instead of append(...) but I don't know if its good idea to rewrite whole '1Y' chunck because I need to add around one week of data.

KonstantinLitvin avatar Dec 14 '19 10:12 KonstantinLitvin

Have you found a clever way of doing this? I've been thinking about this same problem since most of the time we are not refreshing the entire series of data, and mainly updating from last update until today. And maybe some symbols don't need updating altogether.

I would think that you need to save the last updated date in the metadata.

luongjames8 avatar Dec 10 '20 14:12 luongjames8

a chunksize of 1 year seems like a bad idea unless you frequently are reading/writing data of that size

bmoscon avatar Dec 11 '20 00:12 bmoscon

I think @KonstantinLitvin issue is similar to the one in #610

luongjames8 avatar Dec 11 '20 05:12 luongjames8

a chunksize of 1 year seems like a bad idea unless you frequently are reading/writing data of that size

Yes, I usually read 1-3 years of daily (weekly) data and reading with one year chunk size is quite fast in comparison with 10 year / 1 month chunk size

KonstantinLitvin avatar Dec 16 '20 08:12 KonstantinLitvin

Have you found a clever way of doing this? I've been thinking about this same problem since most of the time we are not refreshing the entire series of data, and mainly updating from last update until today. And maybe some symbols don't need updating altogether.

I would think that you need to save the last updated date in the metadata.

Yes, I use metadata for this purpose:

 def append(self, symbol, data_frame, metadata=None):
        metadata = {} if metadata is None else metadata

        metadata.update(self.read_metadata(symbol))
        last_index = metadata.get('last_index')

        if last_index is None:
            last_index = self.get_last_index(symbol)

        overlaps = False
        if last_index in data_frame.index:
            data_frame = data_frame.loc[last_index:]
            overlaps = True

        if not data_frame.empty:
            if overlaps:
                data_frame = data_frame.iloc[1:]
                if data_frame.empty:
                    logger.info(f"no new data")
                    return

            len_before_update = self.get_length(symbol)
            len_chunk = len(data_frame)
            self.library.append(symbol, data_frame)

            len_after_update = self.get_length(symbol)
            assert len_before_update + len_chunk == len_after_update
            logger.info(f"{len_chunk} rows were updated")

            metadata.update({'last_index': self._get_last_index(data_frame)})

            self.write_metadata(symbol, metadata)

        else:
            logger.info(f"no new data")

        if self.duplicates_test(symbol):
            logger.warning(f'found duplicates; library: {self.library_name}, symbol: {symbol}')

KonstantinLitvin avatar Dec 16 '20 08:12 KonstantinLitvin