dplython
dplython copied to clipboard
Add transmute functionality into select
transmute
and select
do very similar things: create a smaller dataframe that is just a few derived columns from the current one.
The different between transmute
and select
is a small one. transmute
creates new columns, basically a mutate
and then a select
. select
only uses existing columns. Why not put all of this functionality inside select?
diamonds >> select(X.carat * 2, X.color, chair=X.table) >> head()
# Out:
# X["carat"] * 2 color chair
# 8.01 I1 61
# 8.01 I1 62
One argument against this is that dplyr uses -
to indicate "drop this". So diamonds %>% select(-carat)
drops the carat row. This seems a little strange here, and separate from the SQL syntax, where a user might expect that this gives you the negative version of carat. To keep this functionality, we could make a new drop
verb which drops rows.
+1. I agree that using -
to drop columns in select
is a little weird anyway. I might prefer a special function like except
that works within select
(much like matches
and contains
in dplyr), rather than adding a whole new verb.
I want to hear input from @dgrtwo on this as well. We should make sure we wouldn't be preventing any other special select
functionality. In particular, pattern-matching column names needs to remain simple. But I think there's an opportunity here for a unified interface for pattern-matching of column names, not just in select
, but also in mutate
and summarize
(instead of dplyr's mutate_each
and summarize_each
).
In your last sentence, I assume you mean "a new drop
verb which drops columns."
Minor annoying thing here... If I do diamonds >> transmute(X.color)
the resulting column will be named X.color
. If I do diamonds >> select(X.color)
the resulting column will be named color
.
I basically want to replace select
with transmute
. So which behavior makes more sense?
I think if we keep the current select
-style, then we should drop all X.
from the string rep of Laters, for consistency sake. If we keep the transmute style, well, then it's terrible because as soon as you do a select your column names are all screwed up with X
getting added to them all the time.
I'd like to hear community opinions, but probably we should just get rid of X
in the string. It's also closer to dplyr results.