Description

Addresses Grouped* and Hierarchical, following up to the comment I added in the thread. I will paste it for easier visibility:

I am having a hard time to think of another way of implementing meta estimators that use pandas just for it's grouping functionalities. Namely HierarchicalPredictors and Grouped* are not necessarily taking a dataframe input, but use pandas groupby's to iterate over different batch of data.

Would this be possible using pure numpy so that we don't need the pandas dependency? yes

Would this be significantly more inconvenient to maintain? absolutely yes

My opinion/suggestion is to leave them as they are in their logic.

If anything, we can allow arbitrary dataframe types as input, and use Narwhals to convert them to pandas and let the rest flow as is.

May 11 '24 20:05 FBruzzesi

🤔 are you sure about adding pyarrow as required? I think we should be able to solve this one

but use pandas groupby's to iterate over different batch of data.

Narwhals' DataFrame.GroupBy also lets you iterate over groups - I'll take a closer look at what's required, but this seems feasible

May 11 '24 20:05 MarcoGorelli

@MarcoGorelli oh I see what you mean! Regarding hierarchical I am 100% sure it is doable, I can do it tomorrow myself (Edit: ok maybe let's say 99 as I need some missing features)

For Grouped, internals are a bit messy, but I will give it a try.

May 11 '24 20:05 FBruzzesi

I am really pushing the boundaries for GroupedPredictor, I think within another swipe I can finish this.

Current status is:

GroupedTransformer passes all the tests, except one in which the group index is provided as negative (honestly, who would do that 😂)
HierarchicalPredictor is failing only one test as well (tests/test_meta/test_hierarchical_predictor.py::test_expected_output), but only for the polars case, which could be due to the fact that the estimator cannot deal with polars?
GroupedPredictor needs some more patches to close the deal.

May 12 '24 21:05 FBruzzesi

Fixed all the issues I had in the conversion and marking this as ready to review. I am quite happy with the results 🙌🏼🙌🏼

The only breaking change from previous version/implementation is the unfeasibility of providing negative indices in groups.

cc: @koaning @MarcoGorelli

May 19 '24 13:05 FBruzzesi

Wow, amazing!

The only breaking change from previous version/implementation is the unfeasibility of providing negative indices in groups.

This is only when starting from non-dataframe input (e.g. numpy) right? If so, then would a patch like

diff --git a/sklego/meta/_grouped_utils.py b/sklego/meta/_grouped_utils.py
index 97180ab..e744f5a 100644
--- a/sklego/meta/_grouped_utils.py
+++ b/sklego/meta/_grouped_utils.py
@@ -20,7 +20,6 @@ def parse_X_y(X, y, groups, check_X=True, **kwargs) -> nw.DataFrame:
     _data_format_checks(X)
 
     # Convert X to Narwhals frame
-    X = nw.from_native(X, strict=False, eager_only=True)
     if not isinstance(X, nw.DataFrame):
         X = nw.from_native(pd.DataFrame(X))
 
diff --git a/sklego/meta/grouped_transformer.py b/sklego/meta/grouped_transformer.py
index 9f6b0d9..9c2eac3 100644
--- a/sklego/meta/grouped_transformer.py
+++ b/sklego/meta/grouped_transformer.py
@@ -109,6 +109,9 @@ class GroupedTransformer(BaseEstimator, TransformerMixin):
         self.fallback_ = None
         self.groups_ = as_list(self.groups) if self.groups is not None else None
 
+        X = nw.from_native(X, strict=False, eager_only=True)
+        if not isinstance(X, nw.DataFrame) and self.groups_ is not None:
+            self.groups_ = [X.shape[1]+group if isinstance(group, int) and group<0 else group for group in self.groups_]
         frame = parse_X_y(X, y, self.groups_, check_X=self.check_X, **self._check_kwargs)
 
         if self.groups is None:
@@ -181,6 +184,7 @@ class GroupedTransformer(BaseEstimator, TransformerMixin):
         """
         check_is_fitted(self, ["fallback_", "transformers_"])
 
+        X = nw.from_native(X, strict=False, eager_only=True)
         frame = parse_X_y(X, y=None, groups=self.groups_, check_X=self.check_X, **self._check_kwargs).drop(
             "__sklego_target__"
         )
diff --git a/tests/test_meta/test_grouped_transformer.py b/tests/test_meta/test_grouped_transformer.py
index b9aae2c..18b28d5 100644
--- a/tests/test_meta/test_grouped_transformer.py
+++ b/tests/test_meta/test_grouped_transformer.py
@@ -291,7 +291,7 @@ def test_array_with_multiple_string_cols(penguins):
     X = penguins
 
     # BROKEN: Failing due to negative indexing... kind of an edge case
-    meta = GroupedTransformer(StandardScaler(), groups=[0, X.shape[1] - 1])
+    meta = GroupedTransformer(StandardScaler(), groups=[0, -1])
 
     transformed = meta.fit_transform(X)

allow you to preserve current behaviour?

May 19 '24 15:05 MarcoGorelli

That's a nice trick! Thanks Marco! Pushing it now

May 19 '24 17:05 FBruzzesi

This should be ready for review. Once we merge into narwhals-development then I would say that we covered large majority of #658 and it would be time for a (quite large) release. @koaning WDYT?

May 23 '24 13:05 FBruzzesi

scikit-lego
scikit-lego copied to clipboard

feat: Make grouped and hierarchical dataframe-agnostic

Description

scikit-lego scikit-lego copied to clipboard

feat: Make grouped and hierarchical dataframe-agnostic

Description

scikit-lego
scikit-lego copied to clipboard