pyjanitor [ENH] Find contiguous blocks of a value in a dataframe

https://stackoverflow.com/questions/30568701/distinct-contiguous-blocks-in-pandas-dataframe

Example use case is where you apply a mask / threshold to a column, and you would like to pick out continuous chunks of time series that are above that threshold.

Jun 19 '19 21:06 zbarry

Do you want the indices of the contiguous blocks returned, or a dataframe filtered to only include those rows? The user could also specify the minimum number of contiguous blocks (eg >=2, >=5 etc)

Jul 16 '19 14:07 hectormz

Do you want the indices of the contiguous blocks returned, or a dataframe filtered to only include those rows? The user could also specify the minimum number of contiguous blocks (eg >=2, >=5 etc)

That's an interesting one. Right now I'm using a .groupby().apply() to a custom function that finds the contiguous region indices and calculates statistics from them. Not sure what the best general-purpose formulation would be.

I currently use scipy.ndimage functions to do this: label, find_objects as in:

def calc_event_stats(df):
    labels, num_events = label(df['my_time_series'])
    
    event_indices = find_objects(labels)
    
    for slices in event_indices:
        event_start_idx = slices[0].start
        event_stop_idx = slices[0].stop
        ...

Oct 12 '19 16:10 zbarry

I see. So in your case, you are calculating statistics for each contiguous block? For your use case, would you be fine specifying:

a column
threshold value
threshold rule (greater than, less than or equal to etc)
list of functions to apply

and get a dataframe (or other format?) back:

Start Index	Stop Index	Length	mean	std	mode	custom_func
0	6	7	54.1	0.3	42	41
10	40	31	98.1	0.43	94	72
174	300	127	65.2	0.7	70	110

This assumes you want stats of the same column that you're finding contiguous blocks, but maybe someone would want these stats for another column.

Perhaps this could be one function, that uses a helper function to find the contiguous blocks (which users could use for some other purpose).

Oct 12 '19 18:10 hectormz

Yeah, that's pretty much it, though I also do want stats on the other columns, as well. Separating out finding blocks from calculating stats does seem like maybe the best way to go about it to keep modularity.

Oct 17 '19 15:10 zbarry