ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat(api): pythonic downsampling + limits

Open jayceslesar opened this issue 3 years ago • 5 comments

Being able to downsample in a pythonic way would be pretty neat, I am sometimes writing queries on time series data that is billions of rows. Currently to downsample I do a mod over a row number index I create, but would be nice to be able to slice it with the [::step_size] syntax for iterables.

Another interesting idea is using a limit sort of like a np.linspace where specificying a limit of 1000 would give you back 1000 rows evenly across some index (must be sorted).

jayceslesar avatar Jan 23 '22 23:01 jayceslesar

@jayceslesar Thanks for making an issue about this.

Indeed, many of the past and current maintainers have worked or currently work extensively with time series and it's an area of the library that is ripe for improvement and expansion!

As you've seen, we have asof_join and basic timestamp/date/time operations, but tools like resampling don't exist yet.

My hunch is that resampling can be implemented as an expression rewrite:

  1. construct a sequence of evenly spaced "index" values, such as dates from a to b as an intermediate table expression (call that index_t and the dates column dates).
  2. left join index_t and the original table (t) on index_t.dates = t.dates
  3. Group by dates and aggregate as desired

The main challenge I see here is the ability of a database to generate a sequence of timestamps/dates/times with a given, though I think even SQLite can do this using common table expressions.

An alternative implementation might create a new subclass of ops.TableNode, such as class Resample(ops.TableNode) and implement the functionality in the translation layer.

This would need some exploration to determine feasibility :)

cpcloud avatar Jan 24 '22 12:01 cpcloud

Regarding the "limit sort" idea, that's interesting.

I think it could probably be done in a similar way as resampling, stopping after the join step.

cpcloud avatar Jan 24 '22 12:01 cpcloud

Regarding the "limit sort" idea, that's interesting.

I think it could probably be done in a similar way as resampling, stopping after the join step.

There are a few ways to do this....One way is to build a uniform index with some mod on the row number which works really well if your data (or you expect your data) to come in consistent steps like every 15 seconds or every x seconds.

jayceslesar avatar Jan 28 '22 18:01 jayceslesar

I have a pretty primitive solution for this....

where num_elements is the number of rows to keep. Might be overkill to include the window column.

num_rows = table.count().execute()

if num_elements > num_rows:  # no need to downsample if this is the case
    return table

rows = np.linspace(0, num_rows, num_elements).astype(int)

window = ibis.window(order_by=table[window_col])
with_index = table.mutate(index=ibis.row_number().over(window))
downsampled = with_index.filter([with_index.index.isin(rows)])
cleaned = downsampled.sort_by(downsampled[window_col]).drop(['index'])

Currently fumbling through the relations.py file to best understand how to implement something like this in it's own resample/downsample class. Could have a really cool way to resample as well given the window_col is a timestamp as well.

jayceslesar avatar May 02 '22 01:05 jayceslesar

@jayceslesar Would you mind adding that to the GitHub Wiki as a recipe? It could be helpful to others. In the meantime, I'm going to move this feature out of the 4.0.0 milestone.

cpcloud avatar Jul 07 '22 13:07 cpcloud

Closing this as stale. We're interested in this feature, but it's not currently on the near term road map. A PR with a proposed API and an implementation with at least two backends would be most welcome.

cpcloud avatar Mar 20 '23 12:03 cpcloud