siuba icon indicating copy to clipboard operation
siuba copied to clipboard

Implement rowwise

Open machow opened this issue 3 years ago • 3 comments

I'm pretty sure rowwise -> mutate is one of the most import patterns siuba could implement. This is because unlike in R, python functions are rarely vectorized. This means that doing operations on a single DataFrame row requires a very cumbersome apply syntax:

import pandas as pd

df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})

# prints e.g. "a is 1, and b is 2"
df.apply(lambda d: "a is {a}, and b is {b}".format(**d), axis = 1)

Note in the example above, apply is receiving a series, so we can use **kwargs syntax.

However if we were to take a manual approach to rowwise, by using...

group_by(tmp = row_number(_)) >> mutate(...)

Then each argument of the mutate would receive a single row DataFrame. This is much harder to work with.

Proposal

  • rowwise creates a special subclass of groupby
  • mutate handles this subclass by passing each operation the equivalent of the Series in this first example (may be a custom class closer to a dict, with lower overhead to init)
  • this means rowwise -> mutate is similar to doing df["new_col"] = df.apply(..., axis = 1)

I should think a bit more about how this fits with fast grouped operations.

machow avatar Apr 06 '21 21:04 machow

This is also one of the reasons why datar implements its own DataFrameGroupBy, which is actually a subclass of pandas' DataFrame.

The idea behind is borrowed from dplyr. The grouped_df or rowwise_df classes are actually a data frame/tibble with group_data, group_vars and group_drop attributes, indicating the grouping data/rows, grouping columns, and whether unobserved values should be dropped while grouping, respectively.

For "rowwise" data frame, we just assigned nothing but the row numbers to the group_data. While doing operations like mutate, a single-row data frame is pulled each time. This makes it easier to align with the way to implement operations on groupby data frame, but of course, sacrifices some performance.

Just some ideas here. You may also take a look at dplyr's source code about rowwise_df and see if there is a better way to implement this.

pwwang avatar Jun 23 '21 22:06 pwwang

One more thing about rowwise data frame that I noticed with dplyr. Extra variables are allowed to be passed to rowwise(...). These columns/variables will then become grouping variables when the data frame turns into a grouped_df (i.e. summarise).

pwwang avatar Jun 23 '21 23:06 pwwang

I'm wondering whether it would also make sense to add a convenience apply function to avoid lambda + kwarg hackery.

Use would be something like:

df >> mutate(new_col=apply(non_vec_func, arg1=_.x, arg2=_.y, other_arg=_.z))

or perhaps force explicit rowwise:

df >> rowwise() >> mutate(new_col=apply(non_vec_func, arg1=_.x, arg2=_.y, other_arg=_.z))

A nice advantage of this is that it makes it easier to have different names for the columns being passed versus the input arg names expected by the function.

nathanjmcdougall avatar Jan 09 '23 20:01 nathanjmcdougall