pyjanitor icon indicating copy to clipboard operation
pyjanitor copied to clipboard

[ENH] Find contiguous blocks of a value in a dataframe

Open zbarry opened this issue 6 years ago • 4 comments

https://stackoverflow.com/questions/30568701/distinct-contiguous-blocks-in-pandas-dataframe

Example use case is where you apply a mask / threshold to a column, and you would like to pick out continuous chunks of time series that are above that threshold.

zbarry avatar Jun 19 '19 21:06 zbarry

Do you want the indices of the contiguous blocks returned, or a dataframe filtered to only include those rows? The user could also specify the minimum number of contiguous blocks (eg >=2, >=5 etc)

hectormz avatar Jul 16 '19 14:07 hectormz

Do you want the indices of the contiguous blocks returned, or a dataframe filtered to only include those rows? The user could also specify the minimum number of contiguous blocks (eg >=2, >=5 etc)

That's an interesting one. Right now I'm using a .groupby().apply() to a custom function that finds the contiguous region indices and calculates statistics from them. Not sure what the best general-purpose formulation would be.

I currently use scipy.ndimage functions to do this: label, find_objects as in:

def calc_event_stats(df):
    labels, num_events = label(df['my_time_series'])
    
    event_indices = find_objects(labels)
    
    for slices in event_indices:
        event_start_idx = slices[0].start
        event_stop_idx = slices[0].stop
        ...

zbarry avatar Oct 12 '19 16:10 zbarry

I see. So in your case, you are calculating statistics for each contiguous block? For your use case, would you be fine specifying:

  • a column
  • threshold value
  • threshold rule (greater than, less than or equal to etc)
  • list of functions to apply

and get a dataframe (or other format?) back:

Start Index Stop Index Length mean std mode custom_func
0 6 7 54.1 0.3 42 41
10 40 31 98.1 0.43 94 72
174 300 127 65.2 0.7 70 110

This assumes you want stats of the same column that you're finding contiguous blocks, but maybe someone would want these stats for another column.

Perhaps this could be one function, that uses a helper function to find the contiguous blocks (which users could use for some other purpose).

hectormz avatar Oct 12 '19 18:10 hectormz

Yeah, that's pretty much it, though I also do want stats on the other columns, as well. Separating out finding blocks from calculating stats does seem like maybe the best way to go about it to keep modularity.

zbarry avatar Oct 17 '19 15:10 zbarry