scikit-lego
scikit-lego copied to clipboard
feat: Make grouped and hierarchical dataframe-agnostic
Description
Addresses Grouped* and Hierarchical, following up to the comment I added in the thread. I will paste it for easier visibility:
I am having a hard time to think of another way of implementing meta estimators that use pandas just for it's grouping functionalities. Namely
HierarchicalPredictor
s andGrouped*
are not necessarily taking a dataframe input, but use pandas groupby's to iterate over different batch of data.
- Would this be possible using pure numpy so that we don't need the pandas dependency? yes
- Would this be significantly more inconvenient to maintain? absolutely yes
My opinion/suggestion is to leave them as they are in their logic.
If anything, we can allow arbitrary dataframe types as input, and use Narwhals to convert them to pandas and let the rest flow as is.
π€ are you sure about adding pyarrow as required? I think we should be able to solve this one
but use pandas groupby's to iterate over different batch of data.
Narwhals' DataFrame.GroupBy
also lets you iterate over groups - I'll take a closer look at what's required, but this seems feasible
@MarcoGorelli oh I see what you mean! Regarding hierarchical I am 100% sure it is doable, I can do it tomorrow myself (Edit: ok maybe let's say 99 as I need some missing features)
For Grouped, internals are a bit messy, but I will give it a try.
I am really pushing the boundaries for GroupedPredictor
, I think within another swipe I can finish this.
Current status is:
-
GroupedTransformer
passes all the tests, except one in which the group index is provided as negative (honestly, who would do that π) -
HierarchicalPredictor
is failing only one test as well (tests/test_meta/test_hierarchical_predictor.py::test_expected_output
), but only for the polars case, which could be due to the fact that the estimator cannot deal with polars? -
GroupedPredictor
needs some more patches to close the deal.
Fixed all the issues I had in the conversion and marking this as ready to review. I am quite happy with the results ππΌππΌ
The only breaking change from previous version/implementation is the unfeasibility of providing negative indices in groups
.
cc: @koaning @MarcoGorelli
Wow, amazing!
The only breaking change from previous version/implementation is the unfeasibility of providing negative indices in groups.
This is only when starting from non-dataframe input (e.g. numpy) right? If so, then would a patch like
diff --git a/sklego/meta/_grouped_utils.py b/sklego/meta/_grouped_utils.py
index 97180ab..e744f5a 100644
--- a/sklego/meta/_grouped_utils.py
+++ b/sklego/meta/_grouped_utils.py
@@ -20,7 +20,6 @@ def parse_X_y(X, y, groups, check_X=True, **kwargs) -> nw.DataFrame:
_data_format_checks(X)
# Convert X to Narwhals frame
- X = nw.from_native(X, strict=False, eager_only=True)
if not isinstance(X, nw.DataFrame):
X = nw.from_native(pd.DataFrame(X))
diff --git a/sklego/meta/grouped_transformer.py b/sklego/meta/grouped_transformer.py
index 9f6b0d9..9c2eac3 100644
--- a/sklego/meta/grouped_transformer.py
+++ b/sklego/meta/grouped_transformer.py
@@ -109,6 +109,9 @@ class GroupedTransformer(BaseEstimator, TransformerMixin):
self.fallback_ = None
self.groups_ = as_list(self.groups) if self.groups is not None else None
+ X = nw.from_native(X, strict=False, eager_only=True)
+ if not isinstance(X, nw.DataFrame) and self.groups_ is not None:
+ self.groups_ = [X.shape[1]+group if isinstance(group, int) and group<0 else group for group in self.groups_]
frame = parse_X_y(X, y, self.groups_, check_X=self.check_X, **self._check_kwargs)
if self.groups is None:
@@ -181,6 +184,7 @@ class GroupedTransformer(BaseEstimator, TransformerMixin):
"""
check_is_fitted(self, ["fallback_", "transformers_"])
+ X = nw.from_native(X, strict=False, eager_only=True)
frame = parse_X_y(X, y=None, groups=self.groups_, check_X=self.check_X, **self._check_kwargs).drop(
"__sklego_target__"
)
diff --git a/tests/test_meta/test_grouped_transformer.py b/tests/test_meta/test_grouped_transformer.py
index b9aae2c..18b28d5 100644
--- a/tests/test_meta/test_grouped_transformer.py
+++ b/tests/test_meta/test_grouped_transformer.py
@@ -291,7 +291,7 @@ def test_array_with_multiple_string_cols(penguins):
X = penguins
# BROKEN: Failing due to negative indexing... kind of an edge case
- meta = GroupedTransformer(StandardScaler(), groups=[0, X.shape[1] - 1])
+ meta = GroupedTransformer(StandardScaler(), groups=[0, -1])
transformed = meta.fit_transform(X)
allow you to preserve current behaviour?
That's a nice trick! Thanks Marco! Pushing it now
This should be ready for review. Once we merge into narwhals-development
then I would say that we covered large majority of #658 and it would be time for a (quite large) release. @koaning WDYT?