auto-sklearn icon indicating copy to clipboard operation
auto-sklearn copied to clipboard

How to give sample weights?

Open kgullikson88 opened this issue 7 years ago • 24 comments

Is there a way to give sample weights to the fit method? I see that the metrics can take it as an argument, but the fit method doesn't.

kgullikson88 avatar May 24 '17 21:05 kgullikson88

Could you please describe your use case? In case the data is unbalanced, Auto-sklearn configures whether to activate the class_weight feature.

mfeurer avatar May 25 '17 15:05 mfeurer

I have some samples that I really need to get right, so want to assign a larger penalty for getting those ones wrong. Many of the sklearn classifiers take a sample_weight keyword argument in the fit method for this purpose (logistic regression does, for example)

kgullikson88 avatar May 25 '17 16:05 kgullikson88

I see your point. Sample weights would then have to be passed to each method in auto-sklearn, especially the final ensemble building procedure, right? However, I think that this would contradict the principle of auto-sklearn which tries to optimize a user-given metric. Do you think it would be possible that you create a custom metric which penalizes a solution for missing your important data points?

mfeurer avatar May 26 '17 08:05 mfeurer

That is sort of what I'm asking how to do. I tried making a custom metric (f3-score) that takes sample weights:

from sklearn.metrics import precision_score, recall_score
from autosklearn.metrics make_scorer

def score_func(y_true, y_pred, beta=3, sample_weight=None):
    if sample_weight is not None:
        prec = precision_score(y_true=y_true, y_pred=y_pred, sample_weight=sample_weight)
        rec = recall_score(y_true=y_true, y_pred=y_pred, sample_weight=sample_weight)
    else:
        prec = precision_score(y_true=y_true, y_pred=y_pred)
        rec = recall_score(y_true=y_true, y_pred=y_pred)
    if prec == 0 and rec == 0:
        return 0.0
    return (1 + beta**2) * prec * rec / (beta**2 * prec + rec)

scorer = make_scorer('f3_score', score_func, sample_weight=weights)

However, when I fit with that, I get errors about incompatible sizes because the metric doesn't know which samples are in the holdout set. The sample weights has shape (full_sample_size,), while the y_true and y_pred values have smaller sizes due to cross validation.

kgullikson88 avatar May 26 '17 14:05 kgullikson88

Thanks for pointing that out. It would be great to have auto-sklearn accept sample weights for the scoring functions. However, I will not be able to implement this feature in the next weeks. If you want to contribute this feature, I would be happy to assist with that.

mfeurer avatar May 29 '17 19:05 mfeurer

Sure, I could give it a shot if you could point me in the direction of where I would need to make the changes.

kgullikson88 avatar May 30 '17 13:05 kgullikson88

Great. Here's a brief tour:

auto-sklearn stores the data in a class AbstractDataManager, from which XYDataManager is derived. The data managers are used to persist the data on disk and are loaded by the evaluation module which takes care about restricting the runtime and memory usage of the target algorithm. Weights of the data points would have to be persisted, too.

It is then used in the evaluator class where the optimization loss is calculated. I think these are the code pieces where changes need to be done in order to influence the optimization procedure.

Furthermore, you would need to change the call to the scoring function in the ensemble builder and ensemble selection. Those two will be a bit trickier, as they rely on the correct sorting of the data (as the sorting will change due to the resampling strategy). You can have a look here how the targets are built in order to accommodate for the change of order.

I hope this is not too complicated and gives a good overview of where the code needs to be changed. In general, a search for call of calculate_score would be a good idea in case I missed one.

One more note: I will probably have no time to reply tomorrow and will be out of office for a few days afterwards. Therefore, I might not reply immediately until next Wednesday.

mfeurer avatar May 30 '17 19:05 mfeurer

Hey, have the sample_weight problem be solved now? Thanks

xiangning-chen avatar Sep 22 '18 22:09 xiangning-chen

It would be great if we can add sample_weight that representing the confidence of each data point.

forest-jiang avatar Nov 21 '19 04:11 forest-jiang

@mfeurer Hi, regarding passing the right sample weights, I'm thinking we can leverage the index from pandas.DataFrame or pandas.Series. Currently, it can be done in sklearn's GridSearchCV (see https://stackoverflow.com/questions/49581104/sklearn-gridsearchcv-not-using-sample-weight-in-score-function?answertab=active#tab-top

I do know that autosklearn converts the X and y into numpy array by applying sklearn.utils.check_array although the pandas data frame are passed to the fit method. Any specific reason you enforce this?

kaiweiang avatar May 04 '20 03:05 kaiweiang

We convert pandas dataframes to numpy arrays because we never found the time to update our packages to accept pandas. I'd be happy about pull requests overcoming this issue.

mfeurer avatar May 18 '20 07:05 mfeurer

Just throwing in another use case: I'm doing multioutput regression and there are some missing values y

krey avatar Aug 21 '21 13:08 krey

Any luck some folks would have create the PR for the sample_weight addition while searching for the best configuration? Cheers.

simonprovost avatar Jan 16 '22 14:01 simonprovost

@mfeurer One of your commits is seen there: https://github.com/automl/auto-sklearn/commit/6de26d7ed1eba30360b5e4a7f72d2dbeab47a072 When you introduce "Feature: weighting for imbalanced classes" in the commit. Is it possible to use sample weights with this feature then ? I am perplexed, especially since the commit is old date and the more recent discussion on this thread.

Cheers.

simonprovost avatar Jan 16 '22 15:01 simonprovost

Hey @simonprovost, no, unfortunately, there has not yet been any progress on this. We'd be happy about a contribution, otherwise, we'll discuss in our next offline meeting whether we can increase the priority on this one.

mfeurer avatar Jan 17 '22 08:01 mfeurer

@mfeurer great, thanks for the prompt answer. I could take a look at it, is the description for contributing you mentioned at the beginning of the post still accurate with the new version of Auto Sklearn?

Cheers

simonprovost avatar Jan 17 '22 08:01 simonprovost

Mostly, from the top of my head these are the modules to be changed:

  • autosklearn.data.abstract_data_manager
  • autosklearn.data.xy_data_manager
  • autosklearn.data.feature_validator
  • autosklearn.evaluation (probably all files in there)
  • autosklearn.ensembles.ensemble_selection
  • autosklearn.ensemble_builder
  • pipeline.components.data_preprocessing.balancing

@eddiebergman can you think of any other modules that need to be updated for this to be supported?

mfeurer avatar Jan 17 '22 14:01 mfeurer

Not of the top of my head, the main difficulty is that sample weights need to be passed through the entire chain of objects which is not entirely transparent, hence the need to update quite a few modules.

I would be happy to regularly review a PR and give guidance during it if you would like to contribute these changes :)

Best, Eddie

eddiebergman avatar Jan 18 '22 09:01 eddiebergman

Yeah, apologies for saying I could do this and then disappearing. I did start taking a look, but got pretty lost in all the code that would need to be changed and then got pulled into other projects.

kgullikson88 avatar Jan 18 '22 15:01 kgullikson88

This would be a very good feature to implement.

dmenig avatar Jun 12 '22 22:06 dmenig

Any updates on that? it seems like a useful thing to do. Actually I think it should be relatively easy, in the sense that sample_weights should be propagated to all the fit() methods of a given pipeline... shouldn't it?

So no need for a custom metric, just propagate the importances down to every fit call

mrektor avatar Nov 09 '22 14:11 mrektor

@mrektor Exactly. As I have just begun my Ph.D. in AutoML, I unfortunately do not have the time to contribute this, in theory, short PR. Otherwise, as the authors indicated, feel free to give it a whirl as they would be delighted to review such PR.

Cheers.

simonprovost avatar Nov 09 '22 14:11 simonprovost

@mrektor Sorry for ghosting, I'm half-back on maintaining auto-sklearn and my first priorities are to update scikit-learn, SMAC, pynisher and ConfigSpace. After that I will add it on to the stack.

In theory yes, quite simple, in practice it's complicated by obscurities in multi-processing and the fact sample-weights are not supported by all components.

eddiebergman avatar Nov 15 '22 16:11 eddiebergman

I see, nice! So are you planning to integrate with scikit 1.x? it was quite a pain having to downgrade as many packages now depend on 1.x... good to know! keep up the good work

mrektor avatar Nov 15 '22 16:11 mrektor