scikit-learn icon indicating copy to clipboard operation
scikit-learn copied to clipboard

Add sample_weight fit param for Pipeline

Open rth opened this issue 5 years ago • 26 comments

Currently Pipeline doesn't support the sample_weight fit param. It fails with,

ValueError: Pipeline.fit does not accept the sample_weights parameter. You can pass parameters to
specific steps of your pipeline using the stepname__parameter format, e.g.
 `Pipeline.fit(X, y, logisticregression__sample_weight=sample_weight)`.

passing the sample weight as logisticregression__sample_weight is annoying because it's too strongly coupled with the pipeline step names. This means that this code would break if a different estimator is used in the pipeline or else one need to use pipeline steps independent of the estimator used. In any case pipelines constructed with make_pipeline would break.

I haven't followed the sample props discussion in detail https://github.com/scikit-learn/scikit-learn/issues/4497 but is there anything preventing from adding a sample_weight parameter to the pipeline and doing one of the following?

  1. either pass it to the last estimator.
  2. or pass it to all estimators that support sample weights in the pipeline (not sure if there are many transformers with sample weights). Raise an ~warning~ error if none support it.

The case 1. would already likely address most use cases. Are there drawbacks in doing this?

rth avatar Aug 14 '20 14:08 rth

To clarify, given an arbitrary classification or regression pipeline, currently to fit it with sample weights one has to do something like,

kwargs = {pipe.steps[-1][0] + '__sample_weight': sample_weight}
pipe.fit(X, y, **kwargs)

This is not very user friendly or readable.

rth avatar Aug 14 '20 14:08 rth

Agreed that the status quo sucks. Passing to the last estimator is okay, but maybe surprising to users that it is not passed to other estimators.

The proposal in SLEP006 and #16079 might have the user call .request_sample_weight() on each estimator that should consume it. Then the Pipeline would pass that parameter to where it is requested.

jnothman avatar Aug 15 '20 11:08 jnothman

The only example I could find that needs sample weights passed to an intermediary step is https://github.com/scikit-learn/scikit-learn/blob/master/examples/text/plot_document_classification_20newsgroups.py#L294-L297.

madhuracj avatar Aug 15 '20 11:08 madhuracj

The case 1. would already likely address most use cases. Are there drawbacks in doing this?

I would say that for a user expecting this behaviour, it will be fine. For the one expecting, sample_weigth to be dispatch everywhere, it will make them angry. At least now, we just annoy everyone.

I would think that you have cases that it could be wrong: you use null weight to discard samples and if you have a decomposition in your pipeline then your space will be different taking into account the sample_weight or not. I am just afraid to make it works without a user knowing what happens (and writing it in the doc does not count, who read the doc :))

glemaitre avatar Aug 17 '20 09:08 glemaitre

For the one expecting, sample_weight to be dispatch everywhere, it will make them angry. At least now, we just annoy everyone.

Right, I understand SLEP006 is much more general but so far I don't understand how it solves this immediate problem. Say there is a black box estimator and you want to pass sample weight to it. I would expect to be able to do,

estimator.fit(X, y, sample_weight=...)

If that estimator happened to be a pipeline, then one of the two options above (with a personal preference for 2.) should still happen without additional information. SLEP006 can make it nicer to specify which estimator gets sample_weight (now one can specify '*__sample_weight' to some extent), but I would still want a reasonable default behavior that addresses most use case when using just the sample_weight fit param and building the pipeline with make_pipeline. Having as only option to use the props fit param (or having to call set_props_request) breaks the API expectations IMO (it's fine to have it as one of possible options). For instance, we can't deprecate normalize in Ridge (https://github.com/scikit-learn/scikit-learn/pull/17772) saying that using pipe = make_pipeline(StandardScaler(), Ridge()) is the same thing until we have sample_weight support in the Pipeline.

have the user call .request_sample_weight() on each estimator that should consume it.

Assuming one chooses option 2, can one already of a lot of it with checking if sample_weight is in the fit signature and if so passing it. I guess this corresponds to the solution 1 in https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep006/proposal.html#solution-1-pass-everything but raising an error if none of the steps accept "sample_weight'. The question is why can't this already done for the the pipeline (i.e. handle pipeline as any other meta-estimator with sample weights) without waiting for the general implementation of SLEP006 ? What would be the downside of doing this?

rth avatar Aug 17 '20 12:08 rth

What I would personally expect is neither 1. or 2., but rather 3. that pipeline passes sample_weight to all steps and raises an error if any of the estimator doesn't support it (i.e. require all steps to have SW support).

It seems that the lack of clear good default behavior is the reason we're forcing users to explicitly address each individual step. This avoids unexpected results.

A lot of users have different expectations, and most definitely don't read the docs. I remember a Twitter shitstorm about people lecturing us because logistic regression is regularized by default...

NicolasHug avatar Aug 17 '20 13:08 NicolasHug

What I would personally expect is neither 1. or 2., but rather 3. that pipeline passes sample_weight to all steps and raises an error if any of the estimator doesn't support it.

+1 sorry, I meant that it should have been the option 2. I wrote raises a warning if no estimator supports it there, but I meant raises an error.

It seems that the lack of clear good default behavior is the reason we're forcing users to explicitly address each individual step. This avoids unexpected results.

I guess, I'm questioning this statement that there is no good default behavior (e.g. use option 2 ~/3~). For instance, even if say StandardScaler supports sample_weight one day, whether we pass sample weight to it before the classifier, either way it will not be incorrect. It's a (debatable) choice that we can document that will be good for 95% of use cases, and for the remaining ones users can pass estimator__sample_weight explicitly or use sample props for more control.

Stretching this argument of "no good default", one could say that there is no good default for hyper-parameters and therefore they should be always passed explicitly. We don't do that because usability matters. Default are not about always providing the ideal solution, but about a reasonable value that works for most frequent use case and that might provide sub-optimal but not incorrect results otherwise.

Or at least I'm looking for concrete counter-examples to this argument among frequent use cases of pipelines.

rth avatar Aug 17 '20 14:08 rth

Stretching this argument of "no good default" ...

fair point

Just to clarify, I think 3) is still different from the newly updated 2). In 3), all steps are required to support SW while 2) is happy with just one of them supporting SW. I updated my comment to make it clearer

NicolasHug avatar Aug 17 '20 14:08 NicolasHug

It's a (debatable) choice

Or it might not be when going back to the definition of sample weights. If setting some of them to zero should be equivalent to removing those samples (https://github.com/scikit-learn/scikit-learn/issues/15657) or more practically passing check_sample_weights_invariance(..., kind="zeros") check.

For instance Ridge passes that check, but a pipeline of StandardScaler + Ridge fails when sample_weight are only passed to the classifier,

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

from sklearn.utils.estimator_checks import check_sample_weights_invariance

class PipelineSW(Pipeline):
    def fit(self, X, y, sample_weight=None):
        """Fit and pass sample weights only to the last step"""
        if sample_weight is not None:
            kwargs = {self.steps[-1][0] + '__sample_weight': sample_weight}
        else:
            kwargs = {}
        return super().fit(X, y, **kwargs)
           
pipe = PipelineSW([('scaler', StandardScaler()), ('ridge', Ridge())])
check_sample_weights_invariance('ridge-pipeline', pipe, kind='zeros')

produces

AssertionError                            Traceback (most recent call last)
[...]
sklearn/utils/estimator_checks.py in 
--> 945             assert_allclose_dense_sparse(X_pred1, X_pred2, err_msg=err_msg)
AssertionError: 
Not equal to tolerance rtol=1e-07, atol=1e-09
For ridge-pipeline, a zero sample_weight is not equivalent to removing the sample
Mismatched elements: 16 / 16 (100%)
Max absolute difference: 0.00946216
Max relative difference: 0.00899501

so approach 1) is likely not correct. In this example, the correct approach would be indeed option 3 (i.e. pass sample weight everywhere), but I'm not sure if it's a general property. However until all estimators support samples weight (e.g. https://github.com/scikit-learn/scikit-learn/issues/15601 for StandardScaler), I feel the most pragmatic/useful solution would still be option 2 (i.g. pass to all estimators that support them at fit). This is also what the TPOT library does for ML pipelines BTW.

rth avatar Aug 17 '20 16:08 rth

Difficulties with option 2:

  • We will not be able to ensure backwards compatibility when an estimator is extended to support sample_weight. Adding sample_weight support to StandardScaler would break code behaviour across versions.
  • You can't determine whether an estimator supports sample_weight programmatically, except by trial and error. I suppose that's okay.

Difficulties with option 1:

  • Why treat sample_weight specially? I assume we should pass all fit params from Pipeline that do not contain __ to the last step... Maybe that's not an issue?
  • The current proposal for SLEP006 requires the consumer to request the param, rather than default behaviour depending on where it is positioned in the pipeline. So implementing this behaviour would mean deprecations or breaking changes were we to accept and implement SLEP006 as currently proposed.

I don't have any difficulties with option 3 except that it doesn't solve your issue.

jnothman avatar Aug 18 '20 00:08 jnothman

a good first step for me is: if sample_weight is passed to fit in pipeline then all steps need to support it.

then we can allow estimator__sample_weights in a next PR to specify to which step you want to pass.

to me this ensures backwards compatibility. It cannot silently behave differently.

wdyt?

agramfort avatar Sep 01 '20 13:09 agramfort

if sample_weight is passed to fit in pipeline then all steps need to support it.

to me this ensures backwards compatibility. It cannot silently behave differently.

I agree. If we want a sample_weight param without explicitly providing the routing information I also don't see a way around this. I would add: all steps need to steps need to support it, or explicitly indicate e.g. via estimator tags that each sample is processed independently (for instance in Normalizer) and therefore sample_weight do not apply.

In practice this means that we would need to add sample weight to preprocessors and ColumnTransformer to make this useful.

rth avatar Sep 02 '20 08:09 rth

In practice this means that we would need to add sample weight to preprocessors and ColumnTransformer to make this useful.

sure but it can be done in the future. We already have a real usecase to justify investing in a first PR

if we achieve this we suddenly solve long standing API issues in linear models

agramfort avatar Sep 02 '20 08:09 agramfort

Just a note that the if all steps support sample_weight is not trivial since a meta estimator may accept **kwargs and we don't know if sample_weight is really supported there or not.

adrinjalali avatar Sep 02 '20 08:09 adrinjalali

Just a note that the if all steps support sample_weight is not trivial since a meta estimator may accept **kwargs and we don't know if sample_weight is really supported there or not.

Yes, we would need some way to determine if an estimator supports sample weight irrespective of the fit **kwargs. For instance an estimator tag fit_sample_weight with one of: True (supports), False (doesn't) and None (not applicable, operates on samples independently). This could probably be mostly auto-determined, and might just need to be fixed manually in a few cases.

As long as we don't remove sample_weight support from estimator, it would be backward compatible: a pipeline that supported sample weight at one versions will also support it for following versions.

rth avatar Sep 02 '20 11:09 rth

For instance an estimator tag fit_sample_weight

For ref, I tried that in https://github.com/scikit-learn/scikit-learn/pull/13565. It wasn't that easy, but some of the issues I encountered about inheritance should be solved now. This one is however still relevant:

if tags are to be used in the code, we need deprecation cycle / warnings to indicate developers that they need to set the tags on their own estimators. For example, changing if has_fit_parameter('sample_weight') to if self._get_tags()['supports_sample_weight'] breaks people's code as of now.

NicolasHug avatar Sep 02 '20 12:09 NicolasHug

Interesting.

if tags are to be used in the code, we need deprecation cycle / warnings to indicate developers that they need to set the tags on their own estimators. For example, changing if has_fit_parameter('sample_weight') to if self._get_tags()['supports_sample_weight'] breaks people's code as of now.

One solution could be to set it to has_fit_parameter('sample_weight') by default in BaseEstimator, that way it would also work for third party estimators. We could even use a helper function that fallbacks to has_fit_parameter('sample_weight') if that estimator tag is not defined.

rth avatar Sep 02 '20 13:09 rth

I like the approach from https://github.com/scikit-learn/scikit-learn/pull/13565/files

+1 to fall back to has_fit_parameter('sample_weight') by default in BaseEstimator if supports_sample_weight is not set. We could also warn that supports_sample_weight is required if sample_weight is available in fit.

I think we should aim for the 80/20 trade off and accept that some hard cases are just too hard (and most probably uncommon)

agramfort avatar Sep 03 '20 20:09 agramfort

+1 to fall back to has_fit_parameter('sample_weight') by default in BaseEstimator

I recall that @jnothman did not like to use has_fit_parameter for sample_weight: https://github.com/scikit-learn/scikit-learn/pull/9041#discussion_r121811902 @jnothman do you have other arguments that I don't recall?

glemaitre avatar Sep 08 '20 13:09 glemaitre

I think that we really need to solve this issue. We might have some silent bug in the meta-estimator wherein the worst-case scenario, sample_weight is silently ignored (i.e. CalibrationClassifierCV). In the best-case scenario, we will raise an error that sample_weight is not supported.

xref: https://github.com/scikit-learn/scikit-learn/issues/21134

glemaitre avatar Sep 24 '21 10:09 glemaitre

let's get https://github.com/scikit-learn/scikit-learn/pull/20350 done then :grin:

adrinjalali avatar Sep 24 '21 10:09 adrinjalali

From the discussion above, I see that passing sample_weight in pipeline.fit() would introduce backward compatibility issues with implicit behavior changes, e.g. when certain Transformer class implements support for sample weight.

If this is right, why can't we have explicit opt-in to pass weights to the last step?

my_pipeline_supporting_weights = make_pipeline(
  [transform1, transform2, my_reressor],
  pass_sample_weights_to_last_step=True,
)

The resulting pipeline would support the same interface for passing sample weights as Regressor or Classifier, with all the benefits associated to it, and no backward compatibility issues, since the behavior is explicitly specified.

NikolayXHD avatar Dec 30 '21 22:12 NikolayXHD

@NikolayXHD You should have a look at the sample-props PR https://github.com/scikit-learn/scikit-learn/pull/20350 that should provide a way to solve the current issue.

glemaitre avatar Jan 04 '22 10:01 glemaitre

Thank you

NikolayXHD avatar Jan 04 '22 13:01 NikolayXHD

Update: https://github.com/scikit-learn/scikit-learn/pull/24027 will solve this issue.

lorentzenchr avatar Aug 06 '22 09:08 lorentzenchr

awesome!

zachmayer avatar Aug 06 '22 12:08 zachmayer

slep6 is still not implemented for Pipeline, and it has a pre-requisite which is to handle composite methods (fit_transform and fit_predict). So I'll let this issue be open for now.

adrinjalali avatar Jun 03 '23 15:06 adrinjalali