pisa
pisa copied to clipboard
Errors running in single precision mode with non-Asimov datasets
Running hypo_testing.py
for NMO studies and getting errors with parameters going to NaN when running in single precision. I modified the error message so it would tell me which parameter and it seems it's happening with the muon background:
ValueError: Param atm_muon_scale has a value nan dimensionless which is not in the range of (<Quantity(0.01, 'dimensionless')>, <Quantity(1.0, 'dimensionless')>)
This could be a problem with non-Asimov datasets or, given that it's in the muon background, it could be something in the icc
script. I set off 4 jobs in single precision, each to run 50 trials, and they all failed after some number of trials (< 10, but I think that's fairly irrelevant). All trials done in double precision are finishing fine.
You said you've gone back and forth with @philippeller on this, but could you post exactly how to reproduce this issue? (Which configs, which data files, and command line used.) I think a first step is having Philipp try to replicate the issue on our workstation that has a K40, running exactly as you are. Something you can try in the meantime is running with KDE turned on.
i wonder if it could have to do with the array datatypes. When i run KDEs I get back single precision arrays, for histogrammed ICC it is not explicitly stated. So maybe @swren could you try adding a .astype(FTYPE)
to the histogramdd outputs in the ICC stage and see if that helps?
Currently running trials with @philippeller's potential solution above. 90 minutes later they're still running but I'll keep an eye on it throughout the day. If that doesn't fix the problem then I'll report back what I did and how to reproduce.
Four hours later and everything is still hunky dory. I will commit this change in to my GRID branch and run some more extensive jobs on the GRID before fully signing it off if you like.
Sounds like it's fixed but I'm not going to get to much merging work today anyway, so commit the change, do whatever testing you'd like, and just let us know when you're confident about the fix. Since that branch is a bit different from cake
probably will need another branch to merge the same changes into cake
. (But that should be easy, since the changes you've made besides getting rid of files aren't too many).
OK Now i'm confused... This fix is almost certainly working in my branch when I run on the interactive machine. Both jobs I've set off here have run until they kill themselves for running too long. But I'm still getting the same error message when I run on the remote cluster...
Yeah I'm pretty sure it was a coincidence that it seemed to work. I'll tell you guys how to reproduce on Monday.
So, about reproducing this (sorry it's a day late...). I use my cake_GRID
branch to run these jobs and I'm running hypo_testing.py
with the DRAGON sample. I run:
export SETTINGS=$PISA/pisa/resources/settings
export MINIMIZER=$SETTINGS/minimizer/bfgs_settings_fac1e5_eps1e-8_mi100.json
export H0PIPELINE1=$SETTINGS/pipeline/nmo_dragon_mc_nufit22_mc.cfg
export H0PIPELINE2=$SETTINGS/pipeline/nmo_dragon_mc_nufit22_icc.cfg
python $PISA/pisa/analysis/hypo_testing.py --logdir $OUTPUTDIRECTORY --minimizer-settings $MINIMIZER --data-is-mc --h0-pipeline $H0PIPELINE1 --h0-pipeline $H0PIPELINE2 --h0-param-selections nh --h0-name NO --h1-param-selections ih --h1-name IO --fluctuate-fid --metric mod_chi2 --num-fid-trials 50 --fid-start-ind 251
where the pipelines make reference to a $DRAGONDIRECTORY where the events file is located as well as my personal discrete systematics files, which I have uploaded to dropbox here:
https://www.dropbox.com/s/pduo3afq62djsva/dom_eff_sysfits_deepcore_gauss.json?dl=0 https://www.dropbox.com/s/qs7wgwxzpy63qt1/hole_ice_fwd_sysfits_deepcore_gauss.json?dl=0 https://www.dropbox.com/s/4fjn2dlgv3kk7u1/hole_ice_sysfits_deepcore_gauss.json?dl=0
$OUTPUTDIRECTORY can obviously be wherever you like...
So now I've actually come to process my output I'm seeing another issue in the single precision mode. I had some trials where I thought the minimiser had obviously failed so I researched a function to remove "outliers" in the chi2 distributions since they were ones which had seemingly obviously stopped early. I had this problem for true NO but not for true IO and the only difference was that I had tried to run true NO jobs in fp32. Sure enough, when I deleted those pseudo-experiments from my output folder, the outlying trials all disappeared...
I wonder if this last issue you see is a matter of minimizer settings needing to be different due to some degree of roughness introduced by lower precision. I think @philippeller (please correct me if I'm wrong, Philipp) had been using SLSQP for his minimizer which might be more robust to this (or it just happens that the settings he had chosen are more robust, and maybe the same thing can be achieved with L-BFGS-G with tweaks).
Due to all of our minimizer settings issues, I think it'd be worthwhile to start a new page on the wiki documenting our trials and tribulations with minimizers / settings for our analyses, so that there can be a central "lab notebook" collecting the little bits of wisdom each of us has had with all of our different configurations...
Yeah you're probably right that that's unrelated to the issues above, I just wanted to bring it up somewhere and this seemed like the best place! 😄
Added a page to the wiki here: https://github.com/jllanfranchi/pisa/wiki/minimizers_and_settings
Possibly same issue here: https://github.com/scipy/scipy/issues/4873
I can't see whether this issue persists.
Let's close as "not planned" because there is no way to reproduce the original issue with the execution of hypo_testing.py
. There doesn't seem to be anything we can do about the possible external minimiser issue raised above (apart from reintroducing a user documentation somewhere).