feat(hierarchical): support sample_weight
Confirmed with @FBruzzesi on the direction of the PR. This discussion took take place in #620
Description
Supporting sample_weight for HierarchicalPredictor, HierarchicalRegressor, HierarchicalClassifier.
Fixes #620
Type of change
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
Checklist:
- [x] My code follows the style guidelines (ruff)
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation (also to the readme.md)
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] I have added tests to check whether the new feature adheres to the sklearn convention
- [ ] New and existing unit tests pass locally with my changes
@FBruzzesi, this is the best I could do, have not been able to understand how to proceed. If you could take some time to assist, it would be great.
=========================================================================== short test summary info ===========================================================================
FAILED tests/test_meta/test_hierarchical_predictor.py::test_sklearn_compatible_estimator[HierarchicalRegressor(estimator=LinearRegression(),groups=0)-check_sample_weights_not_an_array] - ValueError: DataFrame constructor not properly called!
FAILED tests/test_meta/test_hierarchical_predictor.py::test_sklearn_compatible_estimator[HierarchicalRegressor(estimator=LinearRegression(),groups=0)-check_sample_weight_equivalence_on_dense_data] - ZeroDivisionError: Weights sum to zero, can't be normalized
FAILED tests/test_meta/test_hierarchical_predictor.py::test_sklearn_compatible_estimator[HierarchicalRegressor(estimator=LinearRegression(),groups=0)-check_sample_weight_equivalence_on_sparse_data] - ValueError: Estimator does not work on sparse matrices
================================================================= 3 failed, 144 passed, 10 skipped in 11.49s ==================================================================
Thanks for the contribution @AhmedThahir 🚀
@FBruzzesi, this is the best I could do, have not been able to understand how to proceed. If you could take some time to assist, it would be great.
I will take a look later today or tomorrow 😇
Alright!
Will work on the:
- suggested code changes
- additional test cases
Will get back to you once done tonight/tomorrow.
Best Regards Ahmed Thahir LinkedIn https://www.linkedin.com/in/AhmedThahir | YouTube https://www.youtube.com/channel/UCZRDn0ZVbEYEHKVmMFBI6zQ
On Wed, 26 Mar 2025 at 2:23 AM Francesco Bruzzesi @.***> wrote:
@.**** commented on this pull request.
Hey @AhmedThahir https://github.com/AhmedThahir thanks again for the contribution, this is off a great start! I think we are already quite close!
I left a few suggestions in the code. Additionally to those fixes, could you add a few test cases?
In sklego/meta/hierarchical_predictor.py https://github.com/koaning/scikit-lego/pull/737#discussion_r2013012976:
try:
self.estimator_supports_sample_weight = "sample_weight" in inspect.signature(self.estimator.fit).parametersexcept Exception:self.estimator_supports_sample_weight = FalseFor a scikit-learn compatible estimator you cannot do any operation in init method. You can assign such attribute only during fit, and it must be (semi) private, with a trailing _
In sklego/meta/hierarchical_predictor.py https://github.com/koaning/scikit-lego/pull/737#discussion_r2013014762:
@@ -281,12 +289,24 @@ def fit(self, X, y=None): if self.n_features_in_ < 1: msg = "Found 0 features, while a minimum of 1 if required." raise ValueError(msg)
self.has_sw_ = sample_weight is not Noneif self.has_sw_ and not self.estimator_supports_sample_weight:msg = f"Estimator does not support sample_weight."raise ValueError(msg)sample_weight = _check_sample_weight(sample_weight, X, None, ensure_non_negative=True)Here you probably need to cast X to native since scikit-learn does not (yet) work with narwhals objects, also dtype argument has already None as default: ⬇️ Suggested change
sample_weight = _check_sample_weight(sample_weight, X, None, ensure_non_negative=True)
sample_weight = _check_sample_weight(sample_weight, X.to_native(), ensure_non_negative=True)
In sklego/meta/hierarchical_predictor.py https://github.com/koaning/scikit-lego/pull/737#discussion_r2013015988:
_y = nw.to_native(grp_frame[self._TARGET_NAME])
return clone(self.estimator).fit(_X, _y)
args = [_X, _y]if self.estimator_supports_sample_weight and self.has_sw_:and self.has_sw_ can probably be skipped - passing all ones should behave the same as not passing any sample weight (since the estimator supports them)
— Reply to this email directly, view it on GitHub https://github.com/koaning/scikit-lego/pull/737#pullrequestreview-2715331492, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWUEQREUVXLWMRPY3TUNVED2WHCMJAVCNFSM6AAAAABZYVAV5SVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDOMJVGMZTCNBZGI . You are receiving this because you were mentioned.Message ID: @.***>
busy few days at my internship, will get back to you on this as soon as possible, hopefully by the end of the weekend.
Still failing these tests when I run pytest tests/test_meta/test_hierarchical_predictor.py
sklego/meta/_grouped_utils.py:64: ValueError
=========================================================================== short test summary info ===========================================================================
FAILED tests/test_meta/test_hierarchical_predictor.py::test_sklearn_compatible_estimator[HierarchicalRegressor(estimator=LinearRegression(),groups=0)-check_sample_weights_not_an_array] - ValueError: DataFrame constructor not properly called!
FAILED tests/test_meta/test_hierarchical_predictor.py::test_sklearn_compatible_estimator[HierarchicalRegressor(estimator=LinearRegression(),groups=0)-check_sample_weight_equivalence_on_dense_data] - ZeroDivisionError: Weights sum to zero, can't be normalized
FAILED tests/test_meta/test_hierarchical_predictor.py::test_sklearn_compatible_estimator[HierarchicalRegressor(estimator=LinearRegression(),groups=0)-check_sample_weight_equivalence_on_sparse_data] - ValueError: Estimator does not work on sparse matrices
Also, @FBruzzesi , don't the default sklearn sample weight tests (which I have failed above) take care of the tests? What other tests would be required in our case?
Hey @AhmedThahir thanks for pushing the changes
Still failing these tests when I run
pytest tests/test_meta/test_hierarchical_predictor.py
I am trying to debug the tests locally. From what I can see:
-
check_sample_weights_not_an_array(ValueError: DataFrame constructor not properly called!): raises in line 277 because X is not an array type. I think it's ok to raise since some "structure" in the input is required anyway for Hierarchical predictors, however I would raise a more informative error message. -
check_sample_weight_equivalence_on_dense_data(ZeroDivisionError: Weights sum to zero, can't be normalized) - the issue is with the group column being a random continuous value from the automatic sklearn check, hence not really a group-like
One similar issue I can think of is, what should happen if one sample_weight is non-zero, but one group has all zeros in the array subset? I don't have an answer for that, but I think that would be ok to raise.
@koaning considering we are already skipping 9 sklearn compatible estimator check for Hierarchical, I would consider removing that entirely and come up with some ad-hoc tests. What do you think?
Any action required from my side ?
Any action required from my side ?
Hey @AhmedThahir, not really! If you are interested you can think about how we can test most of the hierchical functionalities and edge cases eventually
@koaning considering we are already skipping 9 sklearn compatible estimator check for Hierarchical, I would consider removing that entirely and come up with some ad-hoc tests. What do you think?
Yeah, agree. Optimise for ease of maintainance here :)
Any action required from my side ?
Hey @AhmedThahir, not really! If you are interested you can think about how we can test most of the hierchical functionalities and edge cases eventually
I'm not sure if I'm quite sure how to proceed here. the inbuilt check sample weights seems to take care of all cases that I can think of.
I'm not sure if I'm quite sure how to proceed here. the inbuilt check sample weights seems to take care of all cases that I can think of.
Hey @AhmedThahir, what I meant is bigger scoped actually. For now you might:
- Add the failing tests to the list of those to skip (
check_sample_weights_not_an_array,check_sample_weight_equivalence_on_dense_data,check_sample_weight_equivalence_on_sparse_data) - Introduce new dedicated tests to check sample weight behave as expected in the dataframe/array cases