modnet icon indicating copy to clipboard operation
modnet copied to clipboard

Non-stratified splitting and overwriting of loss function in classification tasks

Open kaueltzen opened this issue 1 year ago • 6 comments

Hello,

I'd like to report two issues regarding classification tasks in modnet:

First, the loss function passed to ModnetModel().fit() is overwritten with "categorical_crossentropy" if val_data is not None and self.multi_label=False: https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/models/vanilla.py#L400-L411 As the loss=None case is already handled before in L352-L360 in the preprocessing of the training data, maybe this could be removed here when preprocessing the validation data?

Second, if nested=False, both FitGenetic and ModnetModel.fit_preset() perform a train test split that is not stratified: https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/models/vanilla.py#L580 https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/hyper_opt/fit_genetic.py#L458-L462 This is an issue in the case of imbalanced datasets and it would be helpful if the splitting was stratified for classification tasks.

If you are interested, I'm happy to raise a PR with fixes.

kaueltzen avatar Oct 20 '24 10:10 kaueltzen

Btw just noticed that the validation split in keras is always taken from the last samples provided to Model.fit(). https://www.tensorflow.org/versions/r2.11/api_docs/python/tf/keras/Model This could be an issue when passing training_data to ModnetModel.fit() that is sorted by label (regardless if a classification / regression task). So, if val_data=None, a shuffling of the training data before calling https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/models/vanilla.py#L475 would make sense in the regression case, while for classification, a stratified split based on val_fraction should maybe already happen inside ModnetModel.fit(). Thoughts?

kaueltzen avatar Oct 20 '24 11:10 kaueltzen

You are correct on the three points. For the second, I agree adding stratification by default makes sense as it follows what we do for k-fold (i.e. StratifiedKFold). For the last one, this indeed only applies to when a float split is used as input. I would either warn in doc saying it splits on last part, or better, do as you suggest to mimic closely what is done on the k-folds and hold-out.

You could perhaps combine points 2/3 by defining a hold-out split function taking a fraction as input and handling shuffling or stratification, similar to the kfold-split here: https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/matbench/benchmark.py#L16 Happy to have a PR on this, thanks !

ppdebreuck avatar Oct 21 '24 15:10 ppdebreuck

Thanks for your answer! One more point regarding overwriting of loss functions that would also be good to have in the PR: In evaluate of ModnetModel and in the case of classification, not the passed loss function, but always the -roc_auc is returned. https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/models/vanilla.py#L768-L784 In the evaluation of individuals in FitGenetic, this method is also used, https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/hyper_opt/fit_genetic.py#L237-L240 so fitting and validation may be done with different metrics.

If you agree, I would also like to implement the passing of a loss function in evaluate for classification tasks.

kaueltzen avatar Oct 25 '24 11:10 kaueltzen

Hi @ppdebreuck one more thing: when looking into stratifying the bootstrapping in fit of EnsembleMODNetModel, I noticed that the same bootstrapped sample is drawn from the training data for len(self.n_models) times because the random_state is always the same: https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/models/ensemble.py#L90-L104 Is this behaviour anticipated? If not, I'd replace it with something that is reproducible but creates len(self.n_models) different samples.

kaueltzen avatar Oct 26 '24 10:10 kaueltzen

Hi @ppdebreuck one more thing: when looking into stratifying the bootstrapping in fit of EnsembleMODNetModel, I noticed that the same bootstrapped sample is drawn from the training data for len(self.n_models) times because the random_state is always the same:

https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/models/ensemble.py#L90-L104

Is this behaviour anticipated? If not, I'd replace it with something that is reproducible but creates len(self.n_models) different samples.

Thanks for noticing this -- it looks like this random state has much overstayed its welcome, and has been perhaps reducing model perf. for a few years (!). I'll raise a separate issue and open a PR fixing this.

ml-evs avatar Oct 26 '24 13:10 ml-evs

@ml-evs It's indeed not the intended behaviour, but note that GA never really uses bootstrapping. It's only when fitting EnsembleModel from scratch, so we never ran into this issue.

I would turn off bootstrapping altogether by default, just having different initial weights is mostly enough or better

ppdebreuck avatar Oct 31 '24 16:10 ppdebreuck