siuba
siuba copied to clipboard
Composing pipes
Related issues
- #246
Below I lay out three challenges for piping. As a precursor to them it's worth noting that there is a delicate balance between a piping strategy that is...
- cool and super {succinct or flexible} and handles a lot of theoretical cases
- less {succinct or flexible} in general, but handles common cases
Ideally a pipe should
- handle lazy (easy):
f = some_pipe; f(data) - handle eager (harder):
data >> some_pipe - have a strategy to go to method chains. eg
some_pipe >> pipe(_.method1().method2()) - have a verbose reference implementation, and a potentially less verbose one (with potentially more exceptions to remember); eg
some_pipe >> _.method1().method2()
(all examples for illustration, not saying that's how it should happen)
With method chains
Suppose we want to do a mutate, and then set it as the index...
from siuba.data import mtcars
from siuba import _, mutate, pipe
(
mtcars
>> mutate(res = _.hp + 2)
>> pipe(_.set_index("res"))
)
If we wanted to keep chaining data frame methods, we would have to either...
- add a pipe for every method in the chain
- add a single pipe, with the whole chain
# approach 1
(
mtcars
>> mutate(res = _.hp + 2)
>> pipe(_.set_index("res"))
>> pipe(_.assign(a = "b"))
)
# approach 2
(
mtcars
>> mutate(res = _.hp + 2)
>> pipe(_
.set_index("res")
.assign(a = "b")
)
# or pipe(_.set_index(...).assign(...))
)
This is because the dot operator has higher precedence than >>, so gets evaluated first.
One solution could be adding a syntax where method chaining off a pipe, like pipe(...).method1().method2() produced a pipe. But I don't think we need more getattr magic.
A final option is presented in the "Simplifying piping to a Symbol" section. This would be to make a Symbol's default behavior for >> to be to produce a Pipeable.
Eager piping with two starting non-pipe funcs
the conditions below are useful for a lazy, but not an eager pipe
# lazy case works
f = "".join >> pipe(_.upper())
f(['a', 'b'])
# eager case raises error
['a', 'b'] >> "".join >> pipe(_.upper())
this is because...
>>is left associative- the eager case has two non-pipe objects on the left.
If the reverse case << were supported, it would be fine...
pipe(_.upper()) << "".join << ['a', 'b']
But using both approaches would create a dumb decision for users.
Another workaround would be putting the lazy pipe part in parentheses...
['a', 'b'] >> ("".join >> pipe(_.upper()))
But this removes an advantage of the normal eager pipe, which is it evaluates line-by-line. That is...
# very explicit, reference implementation
(
['a', 'b']
>> pipe("".join) # runs in python first
>> pipe(_.uppsldkfjer()) # runs in python second
>> pipe(_.upper()) # error above before evaluating this line
)
Simplifying piping to a Symbol or Call
If we wanted to declutter piping, we could change the way >> worked on a symbol to go from...
"".join >> pipe(_.upper())
to
"".join >> _.upper()
This would add an extra caveat to siu expressions _, which right now basically have very few rules to learn.
References
@machow , have you taken a look at sspipe module? It is a good project and might serve you as a reference
Ah, I hadn't--thanks, this looks perfect! I think before I was hesitant to bake the pipe behavior into _, but seeing those examples, it def seems worth it :o.
I see sspipe and siuba as groundbreaking packages and I believe the combination of both can be super powerful, so I'm glad it was helpful, @machow
What about not overloading an operator at all, but using a pipe function?
There's already one in the functoolz package: pipe
For instance:
from siuba.data import mtcars
from siuba import select
from toolz.functoolz import pipe
pipe(
mtcars,
select("mpg", "cyl", "disp"),
lambda _: _.columns.values,
)
I like this in particular because it doesn't require everything to be a pipeable. For instance I can directly pipe into the lambda without using siuba's pipe verb. I don't find it less explicit or harder to read, and it doesn't "abuse" the >> operator.
EDIT: the examples from above work, too:
pipe(
mtcars,
mutate(res=_.hp + 2),
_.set_index("res"),
_.assign(a="b"),
)
pipe(
["a", "b"],
"".join,
_.upper(),
)
Hmm.. yeah--so the naming may be unfortunate, since siuba's pipe is meant to be analogous to the DataFrame.pipe method, but I think your suggestion makes sense.
I'm thinking about for now adding a function, call, to do what the pandas' pipe method does, while also allowing chaining with >>. So to begin with...
from siuba import *
from siuba.data import mtcars
def some_func(a, data):
print(a)
return data
# using new call function
mtcars >> call(some_func, 1, data=_)
# equivalent to
pipe(mtcars, lambda _: some_func(1, data=_))
# longwinded support for method chaining / whatnot
mtcars >> call(_[_.gear < 4])
Then, once I finish a big refactor of siuba's internals (that's almost done?! 😬), I think it will be easier to add in the sspipe like behavior...
# with sspipe behavior
(mtcars
>> _[_.gear < 4]
>> call(some_func, 1, data=_)
)
# with pipe
pipe(
mtcars,
_[_.gear < 4],
lambda d: some_func(1, data=d),
)
I think that intuitively, there's something that feels simpler to me about reading code with the overloaded operator. It seems like knowing that things execute piece by piece (since the pipe w/ >> is all binary operations) feels simpler than an outer function (even if it's a dead simple function). But there are also a lot of places a pipe function would be useful (and I could be totally wrong with preferring an operator ;).
It seems like knowing that things execute piece by piece (since the pipe w/
>>is all binary operations) feels simpler than an outer function (even if it's a dead simple function).
pipe() can be usefuI indeed, but I agree with the above comment. Also, the statements in pipe() are separated by , which is visually confusing.
It seems like knowing that things execute piece by piece (since the pipe w/ >> is all binary operations) feels simpler than an outer function (even if it's a dead simple function).
I know what you mean, I'm just wondering if a pure python developer who has never seen dplyr in action would not find the overloaded >> more confusing.
In any case, your proposal from above sounds great! It also seems that the sspipe behaviour will "just work" with the functoolz.pipe function without additional effort. I also like the idea of renaming siuba.pipe to siuba.call. Even if it does not match the pandas API, I find it has the clearer semantics.
btw, the macropy solution here also looks intriguing. but probably you've already seen it at some point.