feat(api): pythonic downsampling + limits
Being able to downsample in a pythonic way would be pretty neat, I am sometimes writing queries on time series data that is billions of rows. Currently to downsample I do a mod over a row number index I create, but would be nice to be able to slice it with the [::step_size] syntax for iterables.
Another interesting idea is using a limit sort of like a np.linspace where specificying a limit of 1000 would give you back 1000 rows evenly across some index (must be sorted).
@jayceslesar Thanks for making an issue about this.
Indeed, many of the past and current maintainers have worked or currently work extensively with time series and it's an area of the library that is ripe for improvement and expansion!
As you've seen, we have asof_join and basic timestamp/date/time operations, but tools like resampling don't exist yet.
My hunch is that resampling can be implemented as an expression rewrite:
- construct a sequence of evenly spaced "index" values, such as dates from
atobas an intermediate table expression (call thatindex_tand the dates columndates). - left join
index_tand the original table (t) onindex_t.dates = t.dates - Group by
datesand aggregate as desired
The main challenge I see here is the ability of a database to generate a sequence of timestamps/dates/times with a given, though I think even SQLite can do this using common table expressions.
An alternative implementation might create a new subclass of ops.TableNode, such as class Resample(ops.TableNode) and implement the functionality in the translation layer.
This would need some exploration to determine feasibility :)
Regarding the "limit sort" idea, that's interesting.
I think it could probably be done in a similar way as resampling, stopping after the join step.
Regarding the "limit sort" idea, that's interesting.
I think it could probably be done in a similar way as resampling, stopping after the join step.
There are a few ways to do this....One way is to build a uniform index with some mod on the row number which works really well if your data (or you expect your data) to come in consistent steps like every 15 seconds or every x seconds.
I have a pretty primitive solution for this....
where num_elements is the number of rows to keep. Might be overkill to include the window column.
num_rows = table.count().execute()
if num_elements > num_rows: # no need to downsample if this is the case
return table
rows = np.linspace(0, num_rows, num_elements).astype(int)
window = ibis.window(order_by=table[window_col])
with_index = table.mutate(index=ibis.row_number().over(window))
downsampled = with_index.filter([with_index.index.isin(rows)])
cleaned = downsampled.sort_by(downsampled[window_col]).drop(['index'])
Currently fumbling through the relations.py file to best understand how to implement something like this in it's own resample/downsample class. Could have a really cool way to resample as well given the window_col is a timestamp as well.
@jayceslesar Would you mind adding that to the GitHub Wiki as a recipe? It could be helpful to others. In the meantime, I'm going to move this feature out of the 4.0.0 milestone.
Closing this as stale. We're interested in this feature, but it's not currently on the near term road map. A PR with a proposed API and an implementation with at least two backends would be most welcome.