pandas2 icon indicating copy to clipboard operation
pandas2 copied to clipboard

"Predicate pushdown" in group-bys

Open wesm opened this issue 8 years ago • 2 comments

xref #15

I brought this up at SciPy 2015, but there's a significant performance win available in expressions like:

df[boolean_cond].groupby(grouping_exprs).agg(agg_expr)

If you do this currently, it will produce a fully materialized copy of df even if the groupby only touches a small portion of the DataFrame. Ideally, we'd have:

df.groupby(grouping_exprs, where=boolean_cond).agg(...)

I put this as a design / pandas2 issue because the boolean bytes / bits will need to get pushed down into the various C-level groupby subroutines.

wesm avatar Aug 31 '16 03:08 wesm

On the mailing list, you mentioned the idea of an "expression VM", this feels like the kind of thing that would be nicely handled by that? Just making up an API, something like this, where a delayed df builds up a dask/numexpr like graph that can be optimized.

df = pd.read_csv(...)
with pd.delayed(df) as df:
   df['val'] = df['val'] + 100.
   <... several intermediate expressions ... >
   answer = df[cond].groupby(expr).agg(...).compute()

# `df` is unmodified, only `answer` is computed, hopefully very efficiently 

Although that's really broad so maybe this is a useful enough case to just build directly into groupby ops.

chris-b1 avatar Sep 01 '16 02:09 chris-b1

Yeah, the idea behind an "expression VM" is similar to the design of APL interpreters. This is a bigger topic than this issue, but normal pandas operations would be implemented through the eager evaluation of operators in pandas's internal set of functions. Once you have multiple operators you can begin to think about optimizing the evaluation or rearranging the query plan. SFrame (RIP?) notably does this

https://github.com/turi-code/SFrame/tree/master/oss_src/sframe_query_engine

wesm avatar Sep 01 '16 03:09 wesm