hep_ml
hep_ml copied to clipboard
Negative sWeights
Hi,
I am trying to use the BoostingToUniformity notebook, in particular the uBoost classifier. I am getting the error message 'the weights should be non-negative'. I have tried removing this from the source code and tried to run uBoost without this line. When I use the 'predict' function I get an array of all zeros and when I try to plot the ROC curve I get nans as the output. I am wondering if there is a way of dealing with negative weights?
Many thanks,
Martha
Hi Martha, negative weights aren't friendly towards ML because of driving to non-convex unbounded optimization, so you should not expect those to work right for ML models (sometimes they do, however).
@tlikhomanenko sometime ago prepared an overview of strategies for dealing with negative weights, but the first thing you'd better try is to simply remove samples with negative weights from training (but not from testing, that's important)
Hi Martha,
Please have a look at this notebook https://github.com/yandexdataschool/mlhep2015/blob/master/day2/advanced_seminars/sPlot.ipynb prepared for a summer school. There is a part called "Training on sPlot data" where you could find several approaches how to train your classifier on data with negative and positive weights. Hope, you'll find them useful.
For classifiers that only compute statistics on ensembles of events whilst fitting, like decision trees, I would hope that an implementation would accept negative weights, rather than doing assert (weights < 0).sum() == 0
.
When it should fail is if the sum of weights in an ensemble currently under study is negative.
Thanks for your responses. I have tried removing the negative weights from my training sample and classifier.predict(X_train) is giving me an array of all 1's. Do you know why this is happening?
I am using a similar method to the 'Add events two times in training' section in the notes above.
@alexpearce
Hey Alex, I don't think it is so different for trees. Things may go arbitrarily bad in very simple situations:
reg = GradientBoostingRegressor(n_estimators=100, max_depth=1).fit(numpy.arange(2)[:, None], numpy.arange(2), sample_weight=[-0.9999999999, 1])
reg.predict(numpy.arange(2)[:, None])
# outputs: array([9.99999917e+09, 9.99999917e+09])
@marthaisabelhilton
No idea, but try to use clf.predict_proba
to see if those provide meaningful separation.
Yes, negative weights certainly can make things go bad, but in the case of very low sample sizes sWeights also don't make much sense, they only give 'reasonable' results with 'larges ensemble (all poorly defined terms of course). That's what I was suggest algorithms don't check immediately for negative weights, but only when actually computing quantities used in the fitting.
@alexpearce Well, in such case you should check for sum in each particular leaf of the tree (since we aggregating over samples in a leaf).
I see potential complains like "it just worked with two trees, what's the problem with the third one?" (in a huge ensemble like uboost almost surely this check will be triggered), but I don't mind if anyone decides to PR such checks.
Well, in such case you should check for sum in each particular leaf of the tree (since we aggregating over samples in a leaf).
Yes, exactly. The check should be made at that point, rather than when the training data is first fed into the tree.
And you're right, I should just open a PR if I think this is useful behaviour. I'll look into it.
(You're also right, for the third time, that I might be underestimating how often an ensemble of negative weights will have a negative sum, but I would leave that problem up to the users, to tune the hyper parameters.)