dplyr
dplyr copied to clipboard
`mutate` superseding `transmute` should allow ordering columns
I recently noticed that transmute has been marked as superseded by mutate(.keep="none"). However, it turned out that mutate doesn't replicate column ordering behavior of transmute, but does something odd:
> data.frame(a=1, b=2) %>% transmute(a, x=b*2, b)
a x b
1 1 4 2
> data.frame(a=1, b=2) %>% mutate(a, x=b*2, b, .keep="none")
a b x
1 1 2 4
With more complex examples, the ordering becomes pretty confusing and difficult to explain. I'm guessing this may have to do with the .keep = "used" use case resorting things. For .keep = "none", explicit column ordering as given, replicating or approximating transmute behavior, would be much more useful (e.g. order of first LHS mention or last LHS mention).
This is one of the reasons why I still use transmute() over mutate(.keep = "none").
Two very related PRs:
- Unifying the implementation of
.keepformutate()https://github.com/tidyverse/dplyr/pull/6035 - Which then affected
transmute(), so we gave it its own separate implementation here https://github.com/tidyverse/dplyr/issues/6086
Extremely important paragraph:
The dev behavior of .keep = "none" is overall more consistent with the rest of the mutate() options, makes it easier to predict the output when combined with .before and .after, and simplifies the implementation because it means that .keep never affects the column ordering, it is mainly about which columns get dropped (https://github.com/tidyverse/dplyr/pull/6035 goes into this in great detail).
So we should be extremely careful when considering if we want to make any changes here. #6035's big insight is that .keep should not affect the column ordering at all, and I don't think we should go back to that.
An important invariant that falls out here is that .keep plays no role in the column ordering, and I think that is valuable. I think giving keep = "none" special behavior in a few places that changed column order is what made this so hard to get correct before.
I spent a lot of time thinking about those two PRs, and I still think the current implementation is solid theoretically, so I don't think this is a bug as much as some way to incorporate a separate idea from transmute() over into mutate().
I've refreshed myself on the logic in https://github.com/tidyverse/dplyr/pull/6035#issue-1012288146, and I am confident that the current implementation of mutate(.keep = ) with its current 4 variants is correct.
A key principle is that only 1 argument should be able to affect the output ordering. As of right now, .keep does not affect the output ordering in any way. Only .before and .after affect the output ordering, and even they are mutually exclusive because anything else is ambiguous.
So, the only thing I think I can offer is a 5th variant of .keep, let's call it "transmute" for now for lack of a better name. If .keep = "transmute" is set, then .keep would work like "none" but would also now affect the output ordering, meaning that .before and .after would be disallowed in this one case (again, only 1 thing should be able to affect the output ordering).
I feel like .keep = "transmute" is both a bad name and a great name. Bad because it isn't super descriptive on its own, but great because it evokes the legacy idea of "transmute". And also great because we already have "none", and this is basically "none" + transmute ordering, and I can't think of a different word for that idea.
If .keep = "transmute" is simply going to replicate the previous transmute() behavior ... perhaps we can just un-supersede/deprecate transmute()?
As a second bit of reasoning, there's code clarity: I find (and I assume other readers of code are like me) that putting .keep = "none" at the bottom of an expression fundamentally makes reading data pipelines harder, and that seeing transmute() at the start of a step is a clear indicator that we'll be defining a 'new' table in this upcoming step.
I miss my friend transmute() :-)
If reframe() had a .size argument similar to vctrs::vec_recycle_common(), I think reframe(..., .size = n()) would be transmute().
How about adding another option: .order = c('original', 'update') or whichever choice of words fit better here, with the default being original. Maybe even c('default', 'new'). The default would be to maintain the order of the original frame (as is the case currently), or update it to the new order (the way transmute does.
Just a quick comment to say that I second the idea of either adding an .order or a .keep option to preserve the ordering output as expected from transmute.
I am currently working on a a script that converts default data tables as generated from specific hardware to human-readable tables suitable for publication in reports. As such, column order is important. I would have used transmute, but for long term stability decided to go with mutate and .keep = "none".
What is frustrating me is that columns that are carried across unchanged are staying in their original order, and any new columns (either renamed or calculated) are appended to the right in the order they are called. As a further example:
> old <- data.frame(var1 = 1:5, var2 = 6:10, var3 = 11:15, var4 = 16:20, var100 = 101:105)
> old %>% mutate(
var1 = var1,
var2a = var2,
var3 = var3,
var4 = var4,
var5 = var4*2,
.keep = "none"
)
Where I expect the order of "var1", "var2a", "var3", "var4", "var5", I instead get "var1", "var3", "var4", "var2a", "var5".
Just a quick comment to say that I second the idea of either adding an
.orderor a.keepoption to preserve the ordering output as expected fromtransmute.I am currently working on a a script that converts default data tables as generated from specific hardware to human-readable tables suitable for publication in reports. As such, column order is important. I would have used
transmute, but for long term stability decided to go withmutateand.keep = "none".What is frustrating me is that columns that are carried across unchanged are staying in their original order, and any new columns (either renamed or calculated) are appended to the right in the order they are called. As a further example:
> old <- data.frame(var1 = 1:5, var2 = 6:10, var3 = 11:15, var4 = 16:20, var100 = 101:105) > old %>% mutate( var1 = var1, var2a = var2, var3 = var3, var4 = var4, var5 = var4*2, .keep = "none" )Where I expect the order of
"var1", "var2a", "var3", "var4", "var5", I instead get"var1", "var3", "var4", "var2a", "var5".
Precisely. New columns are appended to the right, but since we explicitly name every column we want, we should be able to control exactly how the columns end up like transmute intuitively does.