[ENH] Find contiguous blocks of a value in a dataframe
https://stackoverflow.com/questions/30568701/distinct-contiguous-blocks-in-pandas-dataframe
Example use case is where you apply a mask / threshold to a column, and you would like to pick out continuous chunks of time series that are above that threshold.
Do you want the indices of the contiguous blocks returned, or a dataframe filtered to only include those rows? The user could also specify the minimum number of contiguous blocks (eg >=2, >=5 etc)
Do you want the indices of the contiguous blocks returned, or a dataframe filtered to only include those rows? The user could also specify the minimum number of contiguous blocks (eg >=2, >=5 etc)
That's an interesting one. Right now I'm using a .groupby().apply() to a custom function that finds the contiguous region indices and calculates statistics from them. Not sure what the best general-purpose formulation would be.
I currently use scipy.ndimage functions to do this: label, find_objects as in:
def calc_event_stats(df):
labels, num_events = label(df['my_time_series'])
event_indices = find_objects(labels)
for slices in event_indices:
event_start_idx = slices[0].start
event_stop_idx = slices[0].stop
...
I see. So in your case, you are calculating statistics for each contiguous block? For your use case, would you be fine specifying:
- a column
- threshold value
- threshold rule (greater than, less than or equal to etc)
- list of functions to apply
and get a dataframe (or other format?) back:
| Start Index | Stop Index | Length | mean | std | mode | custom_func |
|---|---|---|---|---|---|---|
| 0 | 6 | 7 | 54.1 | 0.3 | 42 | 41 |
| 10 | 40 | 31 | 98.1 | 0.43 | 94 | 72 |
| 174 | 300 | 127 | 65.2 | 0.7 | 70 | 110 |
This assumes you want stats of the same column that you're finding contiguous blocks, but maybe someone would want these stats for another column.
Perhaps this could be one function, that uses a helper function to find the contiguous blocks (which users could use for some other purpose).
Yeah, that's pretty much it, though I also do want stats on the other columns, as well. Separating out finding blocks from calculating stats does seem like maybe the best way to go about it to keep modularity.