siuba icon indicating copy to clipboard operation
siuba copied to clipboard

Allow columns to "pass through" summarize? (e.g. across)

Open omrihar opened this issue 4 years ago • 7 comments

I have another question which can really optimize the way I work: often I'm performing calculations on aggregates and would like to allow some features (that are constant within the group) to pass through after the summarize. I know it's possible to create new variables in the sense of summarize(new_col=_.feature.mean(), old_col=_.old_col.iloc[0]), for example, but this gets tedious if there are many columns (or even with a few columns).

Is there a way to tell siuba (more specifically summarize) to pass through some variables? And on a related note - is there a way to make the same operation on many columns without having to use gather? (Currently I have the process of gather -> group_by -> summarize -> spread to operate on many same columns)?

Thanks for the awesome library! Omri

omrihar avatar Jun 15 '21 07:06 omrihar

Hi, I think you are making it right with gather. To summarize multiple columns, do you mean something like dplyr's across? To my knowledge, there is no such a verb in siuba yet.

essicolo avatar Jun 16 '21 20:06 essicolo

What I don't like about gather is that, given a table with several hundred thousand rows (which I often have) and about 100 to 200 columns (which I also often have), gather creates an extremely long table on which I have to later (immediately) group_by. This takes much longer (in my experience) than, for example, filter the columns I want and call df.agg() or even sometimes df.mean(), saving the group-operation. I don't know much about dplyr's across (or much about dplyr for that matter) but, from the documentation, it seems that this is what I would like to have, yes.

omrihar avatar Jun 17 '21 06:06 omrihar

To pass through columns with constant value per group like this old_col=_.old_col.iloc[0]), couldn't you just include them in group_by as additional grouping variable?

grst avatar Jun 17 '21 07:06 grst

@grst That's a good idea. I guess sometimes I do this, but it feels like it hides the logic of the analysis. When I calculate means (say) of a group defined by some columns, including other columns (which I know are constant) might be confusing and more difficult to understand down the road, both for me and for others reading my code. If there was a direct way to simply mark some columns as pass-forward, it may make it easier to understand the meaning of the computation. On the other hand, I may simply add this more actively to my bag of tricks.

omrihar avatar Jun 17 '21 08:06 omrihar

Hey, sorry for the delay--I've been thinking about how across could be implemented. It seems like, similar to siuba's implementation of case_when(), across() could essentially take data as its first argument (verbs do this too. e.g. select or mutate).

Here's a case_when example (since apparently it is undocumented 😬).

from siuba.data import mtcars
from siuba import case_when

# outputs numpy array: array(['> 4 cyl', '> 4 cyl', 'other', ...]))
case_when(mtcars, {_.cyl > 4: "> 4 cyl", True: "other"})

# outputs a Symbolic expression
case_when(_, {}) 

# note that case_when works in SQL backends too!

Across proposal

Essentially what could happen is:

  1. across takes data as its first argument (e.g. across(_, ...), across(mtcars, ...))
  • its other args are like dplyr: column selection, functions to apply, etc..
  • it returns a DataFrame(GroupBy)
  1. across(_, _.contains('abc'), _.mean(), ...) within verbs will just get evaluated like other symbolic calls
  • down the road we can probably omit the first _
  1. functions like mutate and summarize will need to be able to handle when evaluated arguments are DataFrame(GroupBy). This would be super handy to add anyway!

Examples

from siuba.data import mtcars
from siuba import _, across

# all the classic selection options
across(mtcars, [_.mpg, _.hp], _.mean())
across(mtcars, _.contains("mpg"), _.mean())
across(_, _[3:5], _.mean())

# with summarize
mtcars >> summarize(across(_, [_.mpg, _.hp], _.mean()))

# an implication of summarize accepting DataFrames as arguments is
# that you can do this.
mtcars >> summarize(_[["mpg", "hp"]].mean())

# or this, which would override mpg and hp to be + 1
mtcars >> mutate(_[["mpg", "hp"]] + 1)

# here is analogous dplyr code. TODO, is the overriding behavior mentioned
# explicitly in dplyr docs?
mutate(mtcars, mtcars[c("hp", "mpg")] + 1)

machow avatar Jun 21 '21 02:06 machow

Sorry for the delayed reply as well (I sometimes forget to check github notifications:sweat_smile:)

This seems like a really nice solution and I'd love to give it a go once you have a working implementation of it :)

omrihar avatar Jul 08 '21 19:07 omrihar

does siuba have across verb?

moeketsims avatar Dec 06 '21 19:12 moeketsims