dplython Allow string arguments to select, dfilter, arrange, group

For a lot of common use cases, column names could be passed as strings rather than properties of X:

selecting particular columns
filtering out missing or false values from single columns
arranging by particular columns
grouping by particular columns

This would slightly increase the interface complexity, but I think it would be easy (arguably easier) to read. It also is consistent with a clean (though verbose) style that uses mutate + group_by/arrange/filter rather than a "complex" group_by/arrange/filter that does an operation before grouping, arranging, or filtering.

Apr 07 '16 19:04 danrobinson

I fully agree for all of these, except some slight intuitive hesitation for dfilter. I've found myself accidentally writing strings when doing a select or an arrange. Definitely makes sense for group_by as well. I think my slight hesitance with dfilter might be that it encourages users to use the string column name. In some cases this will be more implicit than explicit, against the zen of Python.

For example, if you have "num_points" as a column with how many points someone has, you could write this a few ways:

score_df >> dfilter(X.num_points != 0)
score_df >> dfilter(X.num_points)
score_df >> dfilter("num_points")

The first is less efficient but more explicit and easier to read. The later two are a bit harder to immediately parse, and it seems we encourage it a bit by allowing strings.

On the other hand, it would be weird to allow strings everyone else and not here. And I think we should definitely have strings in the other functions... so I think I'm for it.

Hmm I'm now also wondering if there's a better verb than dfilter out there... if we changed that it should be sooner rather than later so we can break everyone's code before too many people have had a chance to write a lot!

Apr 08 '16 03:04 dodger487

I think the second is perfectly Pythonic. In an ordinary conditional, if points is more idiomatic than if points != 0, isn't it?

I think dfilter should just be filter. This is a little weird but you could effectively "overload" the built-in: put in a check for whether it's called with an iterable as the second argument, and if so, forward it through to the built-in filter().

On a somewhat related note, maybe you should replace the wildcard import in the example with explicit imports from dplython import X, mutate, .... Wildcard imports are discouraged (and this also makes less clear what functions are included in dplython). See PEP 8.

Apr 08 '16 05:04 danrobinson

When I explored creating a similar package, the implementation was string based. The strings are evaluated with the dataframe acting as the innermost namespace. This is also how ggplot does it internally with the column mappings in aes(). If you are not used to it, it may seem a bit weird an "magicky" at first. The key advantage is, such an implementation does away with the need of a global variable (X).

Apr 08 '16 07:04 has2k1

The eval option is definitely intriguing, assuming there aren't unavoidable performance or security implications. Maybe it could be an option, alongside functions/lambdas and Laters?

So a mutate could be done one of three ways:

    mutate(twoA=X.a*2)
    mutate(twoA="a*2")
    mutate(twoA=lambda x: x.a*2)

Column-name string arguments to arrange, filter, etc. would simply fall out of that for free (though they could also be optimized if necessary).

Apr 08 '16 12:04 danrobinson

Thinking about this more, I'm worried about doing this. The main reason is that currently,

df >> mutate(twoA="a*2")

returns a dataframe with a column named twoA, filled with the string "a*2".

Being able to add strings into your DataFrame is clearly an important feature that we need to keep. What are ways to resolve this ambiguity? None of these sound good.

Put a function outside the string which tells dplython that the string should be evaluated, e.g. mutate(twoA=StrEval("a*2"))
Put some format inside the string that tells dplython what to do, e.g. mutate(twoA="EVAL: a*2")
Disallow the behavior of assigning strings to columns.

Okay, so that's just mutate. What about the other verbs? summarize has the same problem as mutate. arrange, dfilter, and group_by are more arguable. But now we're adding a behavior in a few functions that intuitively should work in all functions. A user can arrange by "carat" but can't mutate using "carat".

In summary, my main concerns are:

Adds complexity without a great benefit
Adds inconsistency-- users can refer to column names as strings in arrange but can't in mutate or summarize. This is what seems like the bigger problem.

Apr 09 '16 21:04 dodger487

If the solution is eval based then there are two options

Strings are strings so play with the quotes i.e mutate(newcol="'string'")
Define an identity function and make it a part of the execution environment so that mutate(newcol="I(string)").

def I(value):
    return value

First one is free, second can be added as well.

Apr 09 '16 21:04 has2k1

While the first option would work and would be unambiguous if all queries were strings that were eval'd, I think we would still want to support a non-eval option. So we need some way to distinguish between strings meaning "column with this string repeated" and strings meaning "column whose name matches this string"/"column with expression to be eval'd".

This is tricky. How common, though, is the use case of mutating to add a column with just a single string repeated? I don't think there'd be any other ambiguous cases. Why do you think summarize has the same problem?

Maybe just say that string arguments to mutate are interpreted as column names or expressions rather than vectorized strings, and force people who want to make a "foo" column filled with "bar" to wrap it using something like mutate(foo=repeat("bar")).

Apr 10 '16 03:04 danrobinson

I don't know what the "right" style is, but in some of my dplyr code I definitely add columns that are one string. Sometimes I'll have a function output one DataFrame, generate a few DataFrames, add a str column to each one, and append them together.

My thinking has shifted to this: writing X.carat instead of "carat" is the same number of keystrokes-- same amount of effort. Sure, it looks a little strange and feels unusual at first. But dplython uses the Later everywhere, so if you're using dplython you should use Laters. I like there being One Way To Do Things, especially if supporting multiple ways of doing things means adding in more functions or more complexity.

Apr 10 '16 05:04 dodger487

Fair enough, although I'll note that the string evaluation method would save keystrokes on more complex expressions like mutate(cost="price / carat").

Apr 10 '16 05:04 danrobinson

To fit with dplyr should be in the form select_, mutate_, group_by_, etc. For example:

diamonds >> select(X.caret)

vs

diamonds >> select_("caret")

May 19 '16 00:05 dgrtwo

dplython
dplython copied to clipboard

Allow string arguments to select, dfilter, arrange, group_by

dplython dplython copied to clipboard

Allow string arguments to select, dfilter, arrange, group_by

dplython
dplython copied to clipboard