siuba
siuba copied to clipboard
Implement rowwise
I'm pretty sure rowwise -> mutate is one of the most import patterns siuba could implement. This is because unlike in R, python functions are rarely vectorized. This means that doing operations on a single DataFrame row requires a very cumbersome apply syntax:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
# prints e.g. "a is 1, and b is 2"
df.apply(lambda d: "a is {a}, and b is {b}".format(**d), axis = 1)
Note in the example above, apply is receiving a series, so we can use **kwargs
syntax.
However if we were to take a manual approach to rowwise
, by using...
group_by(tmp = row_number(_)) >> mutate(...)
Then each argument of the mutate would receive a single row DataFrame. This is much harder to work with.
Proposal
-
rowwise
creates a special subclass of groupby - mutate handles this subclass by passing each operation the equivalent of the Series in this first example (may be a custom class closer to a dict, with lower overhead to init)
- this means
rowwise
->mutate
is similar to doingdf["new_col"] = df.apply(..., axis = 1)
I should think a bit more about how this fits with fast grouped operations.
This is also one of the reasons why datar
implements its own DataFrameGroupBy
, which is actually a subclass of pandas' DataFrame
.
The idea behind is borrowed from dplyr
. The grouped_df
or rowwise_df
classes are actually a data frame/tibble with group_data
, group_vars
and group_drop
attributes, indicating the grouping data/rows, grouping columns, and whether unobserved values should be dropped while grouping, respectively.
For "rowwise" data frame, we just assigned nothing but the row numbers to the group_data
. While doing operations like mutate
, a single-row data frame is pulled each time. This makes it easier to align with the way to implement operations on groupby data frame, but of course, sacrifices some performance.
Just some ideas here. You may also take a look at dplyr
's source code about rowwise_df
and see if there is a better way to implement this.
One more thing about rowwise
data frame that I noticed with dplyr
. Extra variables are allowed to be passed to rowwise(...)
. These columns/variables will then become grouping variables when the data frame turns into a grouped_df
(i.e. summarise
).
I'm wondering whether it would also make sense to add a convenience apply
function to avoid lambda
+ kwarg hackery.
Use would be something like:
df >> mutate(new_col=apply(non_vec_func, arg1=_.x, arg2=_.y, other_arg=_.z))
or perhaps force explicit rowwise:
df >> rowwise() >> mutate(new_col=apply(non_vec_func, arg1=_.x, arg2=_.y, other_arg=_.z))
A nice advantage of this is that it makes it easier to have different names for the columns being passed versus the input arg names expected by the function.