dplython
dplython copied to clipboard
Allow string arguments to select, dfilter, arrange, group_by
For a lot of common use cases, column names could be passed as strings rather than properties of X:
- selecting particular columns
- filtering out missing or false values from single columns
- arranging by particular columns
- grouping by particular columns
This would slightly increase the interface complexity, but I think it would be easy (arguably easier) to read. It also is consistent with a clean (though verbose) style that uses mutate + group_by/arrange/filter rather than a "complex" group_by/arrange/filter that does an operation before grouping, arranging, or filtering.
I fully agree for all of these, except some slight intuitive hesitation for dfilter
. I've found myself accidentally writing strings when doing a select
or an arrange
. Definitely makes sense for group_by
as well. I think my slight hesitance with dfilter might be that it encourages users to use the string column name. In some cases this will be more implicit than explicit, against the zen of Python.
For example, if you have "num_points" as a column with how many points someone has, you could write this a few ways:
score_df >> dfilter(X.num_points != 0)
score_df >> dfilter(X.num_points)
score_df >> dfilter("num_points")
The first is less efficient but more explicit and easier to read. The later two are a bit harder to immediately parse, and it seems we encourage it a bit by allowing strings.
On the other hand, it would be weird to allow strings everyone else and not here. And I think we should definitely have strings in the other functions... so I think I'm for it.
Hmm I'm now also wondering if there's a better verb than dfilter
out there... if we changed that it should be sooner rather than later so we can break everyone's code before too many people have had a chance to write a lot!
I think the second is perfectly Pythonic. In an ordinary conditional, if points
is more idiomatic than if points != 0
, isn't it?
I think dfilter
should just be filter
. This is a little weird but you could effectively "overload" the built-in: put in a check for whether it's called with an iterable as the second argument, and if so, forward it through to the built-in filter()
.
On a somewhat related note, maybe you should replace the wildcard import in the example with explicit imports from dplython import X, mutate, ...
. Wildcard imports are discouraged (and this also makes less clear what functions are included in dplython). See PEP 8.
When I explored creating a similar package, the implementation was string based. The strings are evaluated with the dataframe acting as the innermost namespace. This is also how ggplot
does it internally with the column mappings in aes()
. If you are not used to it, it may seem a bit weird an "magicky" at first. The key advantage is, such an implementation does away with the need of a global variable (X).
The eval
option is definitely intriguing, assuming there aren't unavoidable performance or security implications. Maybe it could be an option, alongside functions/lambdas and Laters?
So a mutate could be done one of three ways:
mutate(twoA=X.a*2)
mutate(twoA="a*2")
mutate(twoA=lambda x: x.a*2)
Column-name string arguments to arrange, filter, etc. would simply fall out of that for free (though they could also be optimized if necessary).
Thinking about this more, I'm worried about doing this. The main reason is that currently,
df >> mutate(twoA="a*2")
returns a dataframe with a column named twoA, filled with the string "a*2"
.
Being able to add strings into your DataFrame is clearly an important feature that we need to keep. What are ways to resolve this ambiguity? None of these sound good.
- Put a function outside the string which tells dplython that the string should be evaluated, e.g.
mutate(twoA=StrEval("a*2"))
- Put some format inside the string that tells dplython what to do, e.g.
mutate(twoA="EVAL: a*2")
- Disallow the behavior of assigning strings to columns.
Okay, so that's just mutate. What about the other verbs? summarize
has the same problem as mutate. arrange
, dfilter
, and group_by
are more arguable. But now we're adding a behavior in a few functions that intuitively should work in all functions. A user can arrange by "carat"
but can't mutate using "carat"
.
In summary, my main concerns are:
- Adds complexity without a great benefit
- Adds inconsistency-- users can refer to column names as strings in arrange but can't in mutate or summarize. This is what seems like the bigger problem.
If the solution is eval
based then there are two options
- Strings are strings so play with the quotes i.e
mutate(newcol="'string'")
- Define an
identity
function and make it a part of the execution environment so thatmutate(newcol="I(string)")
.
def I(value):
return value
First one is free, second can be added as well.
While the first option would work and would be unambiguous if all queries were strings that were eval'd, I think we would still want to support a non-eval option. So we need some way to distinguish between strings meaning "column with this string repeated" and strings meaning "column whose name matches this string"/"column with expression to be eval'd".
This is tricky. How common, though, is the use case of mutating to add a column with just a single string repeated? I don't think there'd be any other ambiguous cases. Why do you think summarize has the same problem?
Maybe just say that string arguments to mutate are interpreted as column names or expressions rather than vectorized strings, and force people who want to make a "foo" column filled with "bar" to wrap it using something like mutate(foo=repeat("bar"))
.
I don't know what the "right" style is, but in some of my dplyr code I definitely add columns that are one string. Sometimes I'll have a function output one DataFrame, generate a few DataFrames, add a str column to each one, and append them together.
My thinking has shifted to this: writing X.carat
instead of "carat"
is the same number of keystrokes-- same amount of effort. Sure, it looks a little strange and feels unusual at first. But dplython uses the Later everywhere, so if you're using dplython you should use Laters. I like there being One Way To Do Things, especially if supporting multiple ways of doing things means adding in more functions or more complexity.
Fair enough, although I'll note that the string evaluation method would save keystrokes on more complex expressions like mutate(cost="price / carat")
.
To fit with dplyr should be in the form select_
, mutate_
, group_by_
, etc. For example:
diamonds >> select(X.caret)
vs
diamonds >> select_("caret")