nnpdf Ignore 10% worst replicas in hyper loss

Basic implementation, not tested yet and the percentage is not configurable from the runcard.

Mar 20 '24 10:03 APJansen

I have implemented what we discussed by removing the passed as it was created (unless there's no hyperopt, I left that case as is). Instead I've set passed just based on whether the hyper loss is bigger than the threshold, as if there are more than 10% replicas with infinite loss, this will automatically be violated.

I'm a bit confused now if that's what we want, as this check is already there of course. The difference is that the one I added does not include the penalties, and it sets the hyperopt status to failed, whereas the existing check just adds a penalty.

To make the tests pass I had to increase this hyper threshold, as it now fails the trial if it's violated. Can you check if that doesn't mess up other tests @Cmurilochem?

Also, it's running on the GPU again, without lhapdf and conda, see here for the setup and slurm scripts.

Apr 03 '24 13:04 APJansen

I have implemented what we discussed by removing the passed as it was created (unless there's no hyperopt, I left that case as is). Instead I've set passed just based on whether the hyper loss is bigger than the threshold, as if there are more than 10% replicas with infinite loss, this will automatically be violated.

I'm a bit confused now if that's what we want, as this check is already there of course. The difference is that the one I added does not include the penalties, and it sets the hyperopt status to failed, whereas the existing check just adds a penalty.

To make the tests pass I had to increase this hyper threshold, as it now fails the trial if it's violated. Can you check if that doesn't mess up other tests @Cmurilochem?

Also, it's running on the GPU again, without lhapdf and conda, see here for the setup and slurm scripts.

@APJansen. I do not expect any breaks on the other tests by increasing the hyper threshold. But they may take more time as you would be in principle fitting more folds than before. However, I am also a bit confused as to the reason behind this.

If I understood well, in the past (not sure it still now), the passed and not passed had to do with threshold_chi2 which was passed as an argument to instantiate the Stopping object; here. This is used in its monitor_chi2 method which is in turn used in callback.py. But there is also hyper_threshold which was compared with the final (statistic) hyper_loss (including penalties) of each fold and responsible to skip hyperparameter combination. I see now that your passed is directly tied to self.hyper_threshold. I am quite sure that this is your intention that but even so I am also sure that this my discussion might help you in some way. Please, also have a look in monitor_chi2. I see there the it is trying to deal with some NaNs.

Apr 05 '24 07:04 Cmurilochem

Hi @APJansen and @goord, As I discussed before with you, I tried to make an hyperopt experiment and realized that most of the trials finish with {"status": "fail"} and {"loss": NaN}. For instance, out of 85 trials calculated just one seems to have a loss lower than 10. Just in case, I attach below my partial trial.json file: tries.json

After looking at the renew_hyperopt.yml runcard, I noticed that I should have added average_best as my replica_statistic. I overlooked it before as I am copying this runcard directly from the repo to my local directory before running the experiment in snellius.

@APJansen: to avoid confusion in the future, could we add average_best in renew_hyperopt.yml?: https://github.com/NNPDF/nnpdf/blob/47a6898bd53bff165ca23f647f01bbf2938f2da7/n3fit/runcards/hyperopt_studies/renew_hyperopt.yml#L164 I could do that if you say so.
Turning back to the results of the experiments I have done using (without knowing) average as replica_statistic. I still cannot quite understand why so many trials are discarded. Our passed criterium seems to be much more restrictive than before. While looking at the code, I see https://github.com/NNPDF/nnpdf/blob/47a6898bd53bff165ca23f647f01bbf2938f2da7/n3fit/src/n3fit/model_trainer.py#L996 and right after, https://github.com/NNPDF/nnpdf/blob/47a6898bd53bff165ca23f647f01bbf2938f2da7/n3fit/src/n3fit/model_trainer.py#L1038 Both fold_loss and hyper_loss appear to represent reduced-over-replicas hyper_losses for a specific fold.

Apr 19 '24 07:04 Cmurilochem

I'll rebase this on top of master to facilitate the review (it shouldn't make a big difference since the last rebase was not long ago)

May 29 '24 09:05 scarlehoff

@Cmurilochem I've added a runcard option penalties_in_loss.

I'll add this to the docs. I've decided not to add the proportion to the runcard because it adds a bit of complexity (I need to add a custom parser in validphys) and thought that at this stage of the project it was better to skip it.

(However, since I've written already the code, if you would like to have access to the proportion or other arguments for the statistics let me know and I'll push it)

Jun 02 '24 10:06 scarlehoff

@Cmurilochem I've added a runcard option penalties_in_loss.

I'll add this to the docs. I've decided not to add the proportion to the runcard because it adds a bit of complexity (I need to add a custom parser in validphys) and thought that at this stage of the project it was better to skip it.

(However, since I've written already the code, if you would like to have access to the proportion or other arguments for the statistics let me know and I'll push it)

Thanks @scarlehoff. It looks clear. Thanks for your help.

Jun 03 '24 07:06 Cmurilochem

If @Cmurilochem confirms this PR works for him as it is I think it can be merged .

Hi @scarlehoff. Thanks for all your hard work. It looks great to me and it is working as expected. I just added a last commit in which I update our experiment's runcards adding a non-default penalties_in_loss for reproducibility. Please, let me know if you agree with that. For the moment I will take the lead and approve your changes.

edit: oh..did not see you have done so already; please, merge it when you are ready.

Jun 04 '24 14:06 Cmurilochem

nnpdf nnpdf copied to clipboard

Ignore 10% worst replicas in hyper loss

nnpdf
nnpdf copied to clipboard