missingno icon indicating copy to clipboard operation
missingno copied to clipboard

Improve freq argument in matrix's freq argument

Open wangsen992 opened this issue 5 years ago • 7 comments

This is probably a very simple bug to fix. If nobody tackles it I will probably do it some time when I'm free and give a pull request. The error is that the freq argument from msno.matrix(), when activated in missingno.py (the code is shown below), it initiates a date_range starting from the beginning of the day. When it does not get a index value from df.index.get_loc(value), the KeyError is catched and the operation is halted.

The issue is many times this timeseries data might not begin or end on a full day cycle, aka 00:00 am. So maybe simply cut off with the range of input df will solve the problem.

if freq:
        ts_list = []

        if type(df.index) == pd.PeriodIndex:
            ts_array = pd.date_range(df.index.to_timestamp().date[0],
                                     df.index.to_timestamp().date[-1],
                                     freq=freq).values

            ts_ticks = pd.date_range(df.index.to_timestamp().date[0],
                                     df.index.to_timestamp().date[-1],
                                     freq=freq).map(lambda t:
                                                    t.strftime('%Y-%m-%d'))

        elif type(df.index) == pd.DatetimeIndex:
            ts_array = pd.date_range(df.index.date[0], df.index.date[-1],
                                     freq=freq).values

            ts_ticks = pd.date_range(df.index.date[0], df.index.date[-1],
                                     freq=freq).map(lambda t:
                                                    t.strftime('%Y-%m-%d'))
        else:
            raise KeyError('Dataframe index must be PeriodIndex or DatetimeIndex.')
        try:
            for value in ts_array:
                ts_list.append(df.index.get_loc(value))
        except KeyError:
            raise KeyError('Could not divide time index into desired frequency.')

PS: Hopefully the format of the issue is clear. This is my first time to raise issue so any suggestion on modifying this issue would be welcomed.

And great work with this project!

wangsen992 avatar May 25 '20 23:05 wangsen992

Actually I think by simply putting the try-except clause inside the for loop might just work.

for value in ts_array:
    try:
        ts_list.append(df.index.get_loc(value))
    except KeyError:
        logging.warning('Could not divide time index into desired frequency.')

Something like that without breaking the for-loop.

wangsen992 avatar May 25 '20 23:05 wangsen992

If you go ahead and submit a PR I'm happy to take a look at that. :)

ResidentMario avatar May 26 '20 00:05 ResidentMario

What is the status of this issue? I just installed the package and obviously this bug is still NOT fixed?

xxl4tomxu98 avatar Aug 02 '21 16:08 xxl4tomxu98

This bug probably still exists. I didn't look at freq the last time I did an OSS maintenance day, I'll try to look at it the next time I have time.

ResidentMario avatar Aug 02 '21 16:08 ResidentMario

If someone has this problem and cannot cut off their timeseries (gaps between days), another solution could be to reindex time series with a complete range of dates (hh:mm:ss as necessary) and fill the value gaps with NaN.

heyej avatar Oct 14 '21 01:10 heyej

Try removing the .values from the code.

This is probably a very simple bug to fix. If nobody tackles it I will probably do it some time when I'm free and give a pull request. The error is that the freq argument from msno.matrix(), when activated in missingno.py (the code is shown below), it initiates a date_range starting from the beginning of the day. When it does not get a index value from df.index.get_loc(value), the KeyError is catched and the operation is halted.

The issue is many times this timeseries data might not begin or end on a full day cycle, aka 00:00 am. So maybe simply cut off with the range of input df will solve the problem.

if freq:
        ts_list = []

        if type(df.index) == pd.PeriodIndex:
            ts_array = pd.date_range(df.index.to_timestamp().date[0],
                                     df.index.to_timestamp().date[-1],
                                     freq=freq).values

            ts_ticks = pd.date_range(df.index.to_timestamp().date[0],
                                     df.index.to_timestamp().date[-1],
                                     freq=freq).map(lambda t:
                                                    t.strftime('%Y-%m-%d'))

        elif type(df.index) == pd.DatetimeIndex:
            ts_array = pd.date_range(df.index.date[0], df.index.date[-1],
                                     freq=freq).values

            ts_ticks = pd.date_range(df.index.date[0], df.index.date[-1],
                                     freq=freq).map(lambda t:
                                                    t.strftime('%Y-%m-%d'))
        else:
            raise KeyError('Dataframe index must be PeriodIndex or DatetimeIndex.')
        try:
            for value in ts_array:
                ts_list.append(df.index.get_loc(value))
        except KeyError:
            raise KeyError('Could not divide time index into desired frequency.')

PS: Hopefully the format of the issue is clear. This is my first time to raise issue so any suggestion on modifying this issue would be welcomed.

And great work with this project!

Try removing the .values from the code.

maubere-tls avatar Apr 19 '22 02:04 maubere-tls

Hi,

My index on the dataframe has the value in "yyyy-mm-dd hh:mi:ss" format and each row is at 15 min interval. Can you tell me how to use the frequency parameter on the matrix plot?

HemalathaRamanujam2022 avatar Mar 22 '23 09:03 HemalathaRamanujam2022