Non-stratified splitting and overwriting of loss function in classification tasks
Hello,
I'd like to report two issues regarding classification tasks in modnet:
First, the loss function passed to ModnetModel().fit() is overwritten with "categorical_crossentropy" if val_data is not None and self.multi_label=False:
https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/models/vanilla.py#L400-L411
As the loss=None case is already handled before in L352-L360 in the preprocessing of the training data, maybe this could be removed here when preprocessing the validation data?
Second, if nested=False, both FitGenetic and ModnetModel.fit_preset() perform a train test split that is not stratified:
https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/models/vanilla.py#L580
https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/hyper_opt/fit_genetic.py#L458-L462
This is an issue in the case of imbalanced datasets and it would be helpful if the splitting was stratified for classification tasks.
If you are interested, I'm happy to raise a PR with fixes.
Btw just noticed that the validation split in keras is always taken from the last samples provided to Model.fit().
https://www.tensorflow.org/versions/r2.11/api_docs/python/tf/keras/Model
This could be an issue when passing training_data to ModnetModel.fit() that is sorted by label (regardless if a classification / regression task).
So, if val_data=None, a shuffling of the training data before calling
https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/models/vanilla.py#L475
would make sense in the regression case, while for classification, a stratified split based on val_fraction should maybe already happen inside ModnetModel.fit(). Thoughts?
You are correct on the three points. For the second, I agree adding stratification by default makes sense as it follows what we do for k-fold (i.e. StratifiedKFold). For the last one, this indeed only applies to when a float split is used as input. I would either warn in doc saying it splits on last part, or better, do as you suggest to mimic closely what is done on the k-folds and hold-out.
You could perhaps combine points 2/3 by defining a hold-out split function taking a fraction as input and handling shuffling or stratification, similar to the kfold-split here: https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/matbench/benchmark.py#L16 Happy to have a PR on this, thanks !
Thanks for your answer! One more point regarding overwriting of loss functions that would also be good to have in the PR:
In evaluate of ModnetModel and in the case of classification, not the passed loss function, but always the -roc_auc is returned.
https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/models/vanilla.py#L768-L784
In the evaluation of individuals in FitGenetic, this method is also used,
https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/hyper_opt/fit_genetic.py#L237-L240
so fitting and validation may be done with different metrics.
If you agree, I would also like to implement the passing of a loss function in evaluate for classification tasks.
Hi @ppdebreuck one more thing: when looking into stratifying the bootstrapping in fit of EnsembleMODNetModel, I noticed that the same bootstrapped sample is drawn from the training data for len(self.n_models) times because the random_state is always the same:
https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/models/ensemble.py#L90-L104
Is this behaviour anticipated? If not, I'd replace it with something that is reproducible but creates len(self.n_models) different samples.
Hi @ppdebreuck one more thing: when looking into stratifying the bootstrapping in
fitofEnsembleMODNetModel, I noticed that the same bootstrapped sample is drawn from the training data forlen(self.n_models)times because therandom_stateis always the same:https://github.com/ppdebreuck/modnet/blob/e14188d3b8a036bba0a1d9c0a3f538dc58c3cd29/modnet/models/ensemble.py#L90-L104
Is this behaviour anticipated? If not, I'd replace it with something that is reproducible but creates
len(self.n_models)different samples.
Thanks for noticing this -- it looks like this random state has much overstayed its welcome, and has been perhaps reducing model perf. for a few years (!). I'll raise a separate issue and open a PR fixing this.
@ml-evs It's indeed not the intended behaviour, but note that GA never really uses bootstrapping. It's only when fitting EnsembleModel from scratch, so we never ran into this issue.
I would turn off bootstrapping altogether by default, just having different initial weights is mostly enough or better