pandas2 icon indicating copy to clipboard operation
pandas2 copied to clipboard

Add a better pipe functionality by using an "unused" operator

Open jankatins opened this issue 9 years ago • 5 comments

Rs "new" pipes combined with easily added functions more or less made Rs data handling much easier to read and to extend than pandas. The advantage is IMO twofold:

  • using pipes (or . notation in python) is much easier to read than functions itself (df %>% func(...) %>% func2(...) and df.func(...).func2(...) vs func2(func(df, ...),...)
  • using functions as a base makes for easy extensibility (you would need monkey patching to add new functionality to a pandas df)

Pandas nowadwas has a df.pipe() method, but that looks much clumsier compared to the elegance of a separate pipe operator.

So I would like to see pandas2 reserve one of the not so much needed operators (e.g. >>?) for a pipe interface. The pipe interface would let users define new functions (which return a small object which would be used in the >> operator -> probably doable with a decorator around the function).

As this wasn't possible because it is an API breaking change, I would like to propose that it is done in pandas2.

jankatins avatar Aug 24 '16 13:08 jankatins

How do you envision this working from a user perspective as we don't have the code-rewriting benefits of non-standard evaluation (that R has). You could of course require users to write their pipe functions like:

def my_pipe_func(*args, **kwargs):
    def pipe_impl(df, *args, **kwargs):
        # omitted
        pass

    return impl

so that then you could write

df >> my_pipe_func(a, b, c)

instead of the current

df.pipe(pipe_impl, a, b, c)

wesm avatar Aug 24 '16 20:08 wesm

Yes, something like that:

class DataFrame():
    [...]
    def __rshift__(self, other):
         assert isinstance(other, PipeVerb), "Right shift is used for piping, use x.rshift(other) instead"
         return other.__rrshift__(self)

class PipeVerb():
    def __init__(self, func, *args, **kwargs):
        self.pipe_func = func
        self.args = args
        self.kwargs = kwargs

    def __rrshift__(self, input):
        return self.pipe_func(input,  *self.args, **self.kwargs)

def my_verb(x=1, y=2):
     def my_verb_impl(df, x=1, y=2):
        # do something and return something
        pass
     pipe = PipeVerb(my_pipe_impl, x=x, y=y)
     return pipe

#The above could be implemented in a decorator which also supports single dispatch:
from functools import singledispatch # in py >=3.4 and backported
def pipe_verb(func):
    func = singledispatch(func)
    def decorated(*args, **kwargs):
        return PipeVerb(func, *args, **kwargs)
    decorated.register = func.register
    return decorated

@pipe_verb
def my_verb_impl(input, x=1, y=2):
    raise NotImplemented(...)

@my_verb_impl.register(pd.DataFrame)
def my_verb_impl_df(input, x=1, y=2):
    # do something with input being a Dataframe
    pass

@my_verb_impl.register(pd.GroupBy)
def my_verb_impl_gb(input, x=1, y=2):
    # do something with input being a Grouped Dataframe
    pass

simple example:

@pipe_verb
def doit(content):
    print(type(content))
    return content

@doit.register(int)
def doit_int(content):
    print("INT")
    return float(content)
1 >> doit() >> doit()
# prints:
# INT
# <class 'float'>

You still wouldn't get things like df >> my_pipe(x=columnname) but at least df >> my_pipe(x="columnname") will be possible and with a helper symbol like the one from dplython or pandas-ply: df >> my_pipe(x=X.columname) >> select(X[X.column1:X.column2])

The only thing which pandas needs to implement is the __rshift__(self, other) method on all objects which should be pipeable. pandas might also export such a X symbol and common helpers like select_vars (from dplyr) which converts from X[X.column1:X.column2] to the real column names. But this is only needed so that other projects can use this interface to build new "verbs" and these verbs behave in the same way. This could also be done in a different project.

CC: @joshuahhh / @dodger487 from for the above dplyr like interfaces

jankatins avatar Aug 24 '16 22:08 jankatins

here's a dplython extension https://github.com/kieferk/dfply/blob/master/README.md

which for R users i suppose is familiar but is completely non familiar to python users

jreback avatar Oct 04 '16 11:10 jreback

which for R users i suppose is familiar but is completely non familiar to python users

It was unknown in the R world a few years back (the first commit in the magrittr repo which implements the pipe operator %>% is from 1.1.2014) and now it feels like the only right way to do data analysis :-)

jankatins avatar Oct 04 '16 16:10 jankatins

some nice thoughts from the Julia folks: http://julialang.org/blog/2016/10/StructuredQueries

lots to grok in that article

jreback avatar Oct 04 '16 17:10 jreback