pandas2
pandas2 copied to clipboard
Add a better pipe functionality by using an "unused" operator
Rs "new" pipes combined with easily added functions more or less made Rs data handling much easier to read and to extend than pandas. The advantage is IMO twofold:
- using pipes (or
.notation in python) is much easier to read than functions itself (df %>% func(...) %>% func2(...)anddf.func(...).func2(...)vsfunc2(func(df, ...),...) - using functions as a base makes for easy extensibility (you would need monkey patching to add new functionality to a pandas df)
Pandas nowadwas has a df.pipe() method, but that looks much clumsier compared to the elegance of a separate pipe operator.
So I would like to see pandas2 reserve one of the not so much needed operators (e.g. >>?) for a pipe interface. The pipe interface would let users define new functions (which return a small object which would be used in the >> operator -> probably doable with a decorator around the function).
As this wasn't possible because it is an API breaking change, I would like to propose that it is done in pandas2.
How do you envision this working from a user perspective as we don't have the code-rewriting benefits of non-standard evaluation (that R has). You could of course require users to write their pipe functions like:
def my_pipe_func(*args, **kwargs):
def pipe_impl(df, *args, **kwargs):
# omitted
pass
return impl
so that then you could write
df >> my_pipe_func(a, b, c)
instead of the current
df.pipe(pipe_impl, a, b, c)
Yes, something like that:
class DataFrame():
[...]
def __rshift__(self, other):
assert isinstance(other, PipeVerb), "Right shift is used for piping, use x.rshift(other) instead"
return other.__rrshift__(self)
class PipeVerb():
def __init__(self, func, *args, **kwargs):
self.pipe_func = func
self.args = args
self.kwargs = kwargs
def __rrshift__(self, input):
return self.pipe_func(input, *self.args, **self.kwargs)
def my_verb(x=1, y=2):
def my_verb_impl(df, x=1, y=2):
# do something and return something
pass
pipe = PipeVerb(my_pipe_impl, x=x, y=y)
return pipe
#The above could be implemented in a decorator which also supports single dispatch:
from functools import singledispatch # in py >=3.4 and backported
def pipe_verb(func):
func = singledispatch(func)
def decorated(*args, **kwargs):
return PipeVerb(func, *args, **kwargs)
decorated.register = func.register
return decorated
@pipe_verb
def my_verb_impl(input, x=1, y=2):
raise NotImplemented(...)
@my_verb_impl.register(pd.DataFrame)
def my_verb_impl_df(input, x=1, y=2):
# do something with input being a Dataframe
pass
@my_verb_impl.register(pd.GroupBy)
def my_verb_impl_gb(input, x=1, y=2):
# do something with input being a Grouped Dataframe
pass
simple example:
@pipe_verb
def doit(content):
print(type(content))
return content
@doit.register(int)
def doit_int(content):
print("INT")
return float(content)
1 >> doit() >> doit()
# prints:
# INT
# <class 'float'>
You still wouldn't get things like df >> my_pipe(x=columnname) but at least df >> my_pipe(x="columnname") will be possible and with a helper symbol like the one from dplython or pandas-ply: df >> my_pipe(x=X.columname) >> select(X[X.column1:X.column2])
The only thing which pandas needs to implement is the __rshift__(self, other) method on all objects which should be pipeable. pandas might also export such a X symbol and common helpers like select_vars (from dplyr) which converts from X[X.column1:X.column2] to the real column names. But this is only needed so that other projects can use this interface to build new "verbs" and these verbs behave in the same way. This could also be done in a different project.
CC: @joshuahhh / @dodger487 from for the above dplyr like interfaces
here's a dplython extension https://github.com/kieferk/dfply/blob/master/README.md
which for R users i suppose is familiar but is completely non familiar to python users
which for R users i suppose is familiar but is completely non familiar to python users
It was unknown in the R world a few years back (the first commit in the magrittr repo which implements the pipe operator %>% is from 1.1.2014) and now it feels like the only right way to do data analysis :-)
some nice thoughts from the Julia folks: http://julialang.org/blog/2016/10/StructuredQueries
lots to grok in that article