dplyr summarize() with multi-row returns

summarize() with multi-row returns

Open krlmlr opened this issue 1 year ago • 2 comments

As of dplyr 1.0.0, summarize() will create multiple rows per group, according to the length of the return value of the summary function. This new feature leads to unintended behavior if the vector return is accidental, and also can lead to data loss.

library(conflicted)
library(dplyr)

my_custom_summary_function <- function(n) {
  # Should return a scalar, but I accidentally return a vector
  rep(n, n)
}

tibble(n = 2:0) %>% 
  group_by(n) %>% 
  summarize(out = my_custom_summary_function(n), .groups = "drop") %>% 
  ungroup()
#> # A tibble: 3 × 2
#>       n   out
#>   <int> <int>
#> 1     1     1
#> 2     2     2
#> 3     2     2

^{Created on 2022-08-01 by the reprex package (v2.0.1)}

Should we introduce a .multi = c("allow", "require", "fail") argument that supports the pre-1.0.0 strict mode of operation? Should .multi = "fail" even be the default?

library(conflicted)
library(dplyr)

my_custom_summary_function <- function(n) {
  # Should return a scalar, but I accidentally return a vector
  rep(n, n)
}

tibble(n = 2:0) %>% 
  group_by(n) %>% 
  summarize(out = my_custom_summary_function(n), .groups = "drop", .multi = "fail") %>% 
  ungroup()
## Error: `out` has length != 1 in groups 1, 3, use `.multi = "allow"` if this is intended

^{Imagined on 2022-08-01 by the reprex package (v2.0.1)}

Aug 01 '22 06:08 krlmlr

I'm not too worried about this; in general any misspecified summary function could corrupt data.

Aug 01 '22 14:08 hadley

See also https://twitter.com/drob/status/1563198515626770432?s=20&t=iTFWSCPNOGWalIrpXHx2qg

Aug 26 '22 17:08 DavisVaughan

I have occasionally written multi-row summarize pipelines intentionally, but when I do that I need to very carefully document what that code is doing. When I teach summarize, the working model I use is "one row per group"; otherwise it gets confused with mutate. Yes, any bugs in summary functions could corrupt data, but this behavior should be opt-in because (1) R's recycling rules make this sort of behavior easy to trigger accidentally, (2) it's hard to notice and then diagnose when it does happen, and (3) since one-row-per-group is the expected behavior, code that does something different should look like it's doing something different.

Nov 14 '22 17:11 kcarnold

Our plan is to deprecate this behaviour in summarise() and instead introduce a new function specifically for this purpose. We just need a name. Ideas so far:

morph()
transmogrify()
multisummarise()
abridge()
remodel()
remould()
renovate()
revamp()
abridge()
shorten()
contract()
lessen()
condense()
synopsize()

I think ideally the word would be closer to summarise() than mutate(), i.e. starting later in the alphabet or ending in ise (although then we'd need UK/US variants, which isn't ideal). I think it's ok if the verb implies an unconditional shrinking, even though some uses might increase the number of rows; we also say that [ subsets a vector.

Nov 17 '22 15:11 hadley

I like morph() the most out of all of these. It implies some kind of stretching/shrinking of the data without implying any direction. And it is fairly short.

Nov 17 '22 16:11 DavisVaughan

I like morph() too. What will happen with morph(<grouped_df>) ?

Are we considering being specific about shinking/growing ? i.e. we could have shrink() and grow() or something.

Nov 18 '22 05:11 romainfrancois

morph(<grouped_df>) would have to work like summarise(<grouped_df>) currently works, I think. i.e. each group computation can return any number of rows, and we recycle the per group results "rowwise" across the resulting columns. And we'd add .by support for morph()

I'm somewhat confident we don't need to care about the direction, mainly it's:

summarise() has the guarantee of 1 row per group. More predictable for users. Harder to make a mistake. Easier data base translations.
morph() just relaxes that guarantee, but otherwise works similarly. But when you see morph() in code it should be a clear signal that something is happening that isn't a pure summary, which is pretty nice

Nov 18 '22 14:11 DavisVaughan

I'm remembering that tidygraph uses morph(), which might be enough to prevent us from using it.

I also thought of restructure(), which is kind of nice because it is closer to summarise() in the alphabet and the core part of each verb starts with s (structure and summarise). And it seems to nicely convey that you are taking an existing data frame and reworking it into some new form (with little restriction on the number of rows or columns). The only potential problem is possible confusion with reshape(), but I think I'm ok with it.

It seems somewhat reasonable to say that summarise() is a restricted version / special case of restructure().

Nov 19 '22 23:11 DavisVaughan

Since this operation is sometimes called "split-apply-combine", perhaps recombine, or rebuild, reconstruct, remake, or reform? Since we're making an entirely new data frame by combining the results of operations on each group.

Or, more related to existing verbs, something based on bind_rows? bind_rows_groupwise? tibble_groupwise?

Nov 20 '22 01:11 kcarnold

Or just build(), i.e. "build a new data frame from an existing one", if we aren't worried about conflicting with devtools::build() that sounds pretty good

Nov 20 '22 13:11 DavisVaughan

Another build synonym would be assemble().

In the building/construction metaphor: renovate()

Crazy idea: this function is a sort of combination of mutate() and summarise() so we could call it summate(), which means to sum up.

Nov 22 '22 13:11 hadley

I think I'd be fairly happy with assemble()

It doesn't immediately come up as being used by any big packages
Still no direction implied in the name, which i like
I like this idea of using the name to reflect that this "creates a new data frame", which we have always described summarise() as theoretically doing
I like that it doesn't start with re*()

Nov 22 '22 14:11 DavisVaughan

I like the crazy idea (summate()) because it explains what it does (relax the size constraints of mutate and summarise so it can be anything in between) without really introducing a new verb (it's a portmanteau).

Among the other suggestions I prefer morph() for the same reason, because of this idea that unconstrained form of the result.

Nov 22 '22 14:11 lionel-

Since you can also expand the rows, I think summate is not such a good name after all.

Maybe a verb like remodel() would be a good way of expressing the change in shape.

Nov 22 '22 15:11 lionel-

In another direction, verbs that imply recreating a data frame:

retibble()
reframe()
redefine()

Relationship between reframe and tibble frame functions:

enframe: vector → df
deframe: df → vector
reframe: df → df

Nov 22 '22 15:11 lionel-

FWIW, as a long time dplyr user I'm not hugely keen on morph() - in my mind it doesn't feel suggestive of summarise()-like behaviour. Of all the suggestions so far I like multisummarise() best, but I feel like there's a better counterpoint out there. Some extra suggestions:

elaborate()
abbreviate()
telescope()
restate()
revise()

Nov 24 '22 22:11 wurli

it doesn't feel suggestive of summarise()-like behaviour.

I somewhat strongly believe that you should not try to connect this new verb to summarise() too closely in your head:

summarise(): reduce each group down to 1 row
new verb: "do something" to each group

It just happens to be that summarise() is a "special case" of this new verb, but in terms of daily practical usage that is as far as I'd take the comparison.

Real-life usage of this new verb typically looks awkward if summarise() is in the name, because it very often isn't actually performing any kind of summary operation.

Nov 25 '22 11:11 DavisVaughan

Some real-world examples would help in picking the name. (I thought I'd had some, but in a quick look through my stuff, I only found examples of where the multi-row behavior wasn't what I'd wanted.)

Nov 26 '22 14:11 kcarnold

Throwing another name into the hat, because I like short names, I'll suggest draw() as in either (take your pick)

to "draw" out specific data from a tibble, as in "draw water from a well"
to "draw" a new tibble from an existing one, as in "draw a picture"

Nov 26 '22 17:11 eutwt

A few real life examples.

With ivs, which generally takes sets of intervals and returns other sets of arbitrary size (notably can return more or less rows than you started with!)

library(dplyr)
library(ivs)

df <- tibble(
  start = as.Date(c("2019-01-01", "2019-01-04", "2019-01-07")),
  end = as.Date(c("2019-01-05", "2019-01-06", "2019-01-08"))
) %>%
  mutate(iv = iv(start, end), .keep = "none")

df
#> # A tibble: 3 × 1
#>                         iv
#>                 <iv<date>>
#> 1 [2019-01-01, 2019-01-05)
#> 2 [2019-01-04, 2019-01-06)
#> 3 [2019-01-07, 2019-01-08)

# Merge all overlapping ranges
df %>%
  morph(iv = iv_groups(iv))
#> # A tibble: 2 × 1
#>                         iv
#>                 <iv<date>>
#> 1 [2019-01-01, 2019-01-06)
#> 2 [2019-01-07, 2019-01-08)

# Split all overlapping ranges into non-overlapping disjoint sets
df %>%
  morph(iv = iv_splits(iv))
#> # A tibble: 4 × 1
#>                         iv
#>                 <iv<date>>
#> 1 [2019-01-01, 2019-01-04)
#> 2 [2019-01-04, 2019-01-05)
#> 3 [2019-01-05, 2019-01-06)
#> 4 [2019-01-07, 2019-01-08)

Similar idea with intersect():

library(dplyr, warn.conflicts = FALSE)

table <- c("a", "b", "d", "f")

df <- tibble(
  g = c(1, 1, 1, 2, 2, 2, 2),
  x = c("e", "a", "b", "c", "f", "d", "a")
)

# `morph()` allows you to apply functions that return
# an arbitrary number of rows
df %>%
  morph(x = intersect(x, table))
#> # A tibble: 4 × 1
#>   x    
#>   <chr>
#> 1 a    
#> 2 b    
#> 3 f    
#> 4 d

Doing something silly like reproducing slice_head()

library(dplyr)
df <- tibble(
  g = c(1, 1, 2, 2, 2),
  x = c(4, 5, 1, 2, 3)
)
df %>%
  morph(x = sample(x, 4, replace = TRUE), .by = g)
#> # A tibble: 8 × 2
#>       g     x
#>   <dbl> <dbl>
#> 1     1     4
#> 2     1     5
#> 3     1     4
#> 4     1     4
#> 5     2     2
#> 6     2     3
#> 7     2     2
#> 8     2     3

An older pattern combined with read_csv() and multiple files, from the original dplyr 1.0.0 blog post about this feature

tibble(path = dir(pattern = "\\.csv$")) %>% 
  rowwise(path) %>% 
  morph(read_csv(path))

Nov 27 '22 14:11 DavisVaughan

How about create()?

Because you "create a new result from each group" (this would be the help page title)
Can also be seen as "create a new data frame from an existing one"
- Which ties to our theoretical beliefs that this and summarise() create a "new" data frame, as opposed to mutate()
Easy to tie to summarise(), because that "creates a 1 row summary from each group". So it is a special case of this.
Does not imply a direction
Does not imply a number of rows returned
Does not seem to be taken by any packages
The name works very well with all of my real life examples above, even the read_csv() one

With the ivs example, I would say that I "create the groups by merging the overlapping ranges", and the code is create(groups = iv_groups(iv)).

So far this is my favorite option

Subjective reasons I like it:

Has an artistic flair to it. "Creation" has less rules tied to it, i.e. like the rules about the number of rows returned
Fairly short name
It is a name with positive connotations
Feels along the same lines as mutate() and summarise()

Nov 27 '22 14:11 DavisVaughan

I like it. It seems a bit too general to me though, compared to something like reframe() which is a more practical description of what is happening. But I agree that it feels more similar to mutate and summarise.

Nov 28 '22 08:11 lionel-

I like create() and believe it would read very well with .by =

Nov 28 '22 09:11 romainfrancois

I also like create() a lot but agree that it possibly feels overly general

Nov 28 '22 09:11 wurli

When you consider the family it's not immediately obvious why create() is called like that because all the verbs are an act of creation:

mutate() creates new columns or recreates existing columns within an existing data frame.
summarise() creates a new data frame with size-1 summaries from an existing one.
create() creates a new data frame from an existing one.

I think this illustrates why create() is too general.

Nov 28 '22 10:11 lionel-

I actually liked create() because it was fairly general 😆

The fact that you can describe mutate() and summarise() using the word "create" didn't bother me too much, since their names imply they are stricter variants of it. create() is just an act of creation with the fewest restraints possible

Nov 28 '22 13:11 DavisVaughan

I think create() feels a little strange because the object of the verb (as you'd use it in normal speech) is the output instead of the input. Like, you summarise()/mutate() an existing data frame but you create() a new data frame. That being said, I think it does seem to work better than the other suggestions (incl. mine above)

Nov 28 '22 13:11 eutwt

I like create(), but I'm afraid it sounds too magical. In my understanding, the function is rather for experts compared to mutate() and summarize() with single-row-returns, so probably it should sound more difficult.

What about explode(), which is used in Hive/Spark SQL? c.f. https://spark.apache.org/docs/latest/api/sql/index.html#explode

Nov 28 '22 15:11 yutannihilation

I've been thinking about this for a few days but I haven't come up with a new name that I like, and I'm afraid I don't think any of the ones suggested here sit right with me. At the risk of being very not creative, I would suggest something like multi_summarize() or summarize_multi().

Otherwise reframe() makes the most sense but I think it won't be trivial to teach when to use reframe() vs. summarize(), as in, how will someone know they should use summarize() instead of reframe()? (Though this is maybe more of a comment on the function's functionality than its name.)

Nov 28 '22 16:11 mine-cetinkaya-rundel

Maybe it's only me but I am not completely convinced that we need a complete new function here. I actually liked @krlmlr initial suggestion having a separate argument .multi in summarise that can define the behaviour. I can't find the discussion why that idea was rejected.

Nov 29 '22 07:11 shahronak47

dplyr dplyr copied to clipboard

summarize() with multi-row returns

dplyr
dplyr copied to clipboard