pisa icon indicating copy to clipboard operation
pisa copied to clipboard

Errors running in single precision mode with non-Asimov datasets

Open steven-j-wren opened this issue 8 years ago • 14 comments

Running hypo_testing.py for NMO studies and getting errors with parameters going to NaN when running in single precision. I modified the error message so it would tell me which parameter and it seems it's happening with the muon background:

ValueError: Param atm_muon_scale has a value nan dimensionless which is not in the range of (<Quantity(0.01, 'dimensionless')>, <Quantity(1.0, 'dimensionless')>)

This could be a problem with non-Asimov datasets or, given that it's in the muon background, it could be something in the icc script. I set off 4 jobs in single precision, each to run 50 trials, and they all failed after some number of trials (< 10, but I think that's fairly irrelevant). All trials done in double precision are finishing fine.

steven-j-wren avatar Nov 22 '16 21:11 steven-j-wren

You said you've gone back and forth with @philippeller on this, but could you post exactly how to reproduce this issue? (Which configs, which data files, and command line used.) I think a first step is having Philipp try to replicate the issue on our workstation that has a K40, running exactly as you are. Something you can try in the meantime is running with KDE turned on.

jllanfranchi avatar Nov 23 '16 17:11 jllanfranchi

i wonder if it could have to do with the array datatypes. When i run KDEs I get back single precision arrays, for histogrammed ICC it is not explicitly stated. So maybe @swren could you try adding a .astype(FTYPE) to the histogramdd outputs in the ICC stage and see if that helps?

philippeller avatar Nov 23 '16 18:11 philippeller

Currently running trials with @philippeller's potential solution above. 90 minutes later they're still running but I'll keep an eye on it throughout the day. If that doesn't fix the problem then I'll report back what I did and how to reproduce.

steven-j-wren avatar Nov 24 '16 11:11 steven-j-wren

Four hours later and everything is still hunky dory. I will commit this change in to my GRID branch and run some more extensive jobs on the GRID before fully signing it off if you like.

steven-j-wren avatar Nov 24 '16 14:11 steven-j-wren

Sounds like it's fixed but I'm not going to get to much merging work today anyway, so commit the change, do whatever testing you'd like, and just let us know when you're confident about the fix. Since that branch is a bit different from cake probably will need another branch to merge the same changes into cake. (But that should be easy, since the changes you've made besides getting rid of files aren't too many).

jllanfranchi avatar Nov 24 '16 14:11 jllanfranchi

OK Now i'm confused... This fix is almost certainly working in my branch when I run on the interactive machine. Both jobs I've set off here have run until they kill themselves for running too long. But I'm still getting the same error message when I run on the remote cluster...

steven-j-wren avatar Nov 25 '16 10:11 steven-j-wren

Yeah I'm pretty sure it was a coincidence that it seemed to work. I'll tell you guys how to reproduce on Monday.

steven-j-wren avatar Nov 25 '16 22:11 steven-j-wren

So, about reproducing this (sorry it's a day late...). I use my cake_GRID branch to run these jobs and I'm running hypo_testing.py with the DRAGON sample. I run:

export SETTINGS=$PISA/pisa/resources/settings
export MINIMIZER=$SETTINGS/minimizer/bfgs_settings_fac1e5_eps1e-8_mi100.json
export H0PIPELINE1=$SETTINGS/pipeline/nmo_dragon_mc_nufit22_mc.cfg
export H0PIPELINE2=$SETTINGS/pipeline/nmo_dragon_mc_nufit22_icc.cfg

python $PISA/pisa/analysis/hypo_testing.py --logdir $OUTPUTDIRECTORY --minimizer-settings $MINIMIZER --data-is-mc --h0-pipeline $H0PIPELINE1 --h0-pipeline $H0PIPELINE2 --h0-param-selections nh --h0-name NO --h1-param-selections ih --h1-name IO --fluctuate-fid --metric mod_chi2 --num-fid-trials 50 --fid-start-ind 251

where the pipelines make reference to a $DRAGONDIRECTORY where the events file is located as well as my personal discrete systematics files, which I have uploaded to dropbox here:

https://www.dropbox.com/s/pduo3afq62djsva/dom_eff_sysfits_deepcore_gauss.json?dl=0 https://www.dropbox.com/s/qs7wgwxzpy63qt1/hole_ice_fwd_sysfits_deepcore_gauss.json?dl=0 https://www.dropbox.com/s/4fjn2dlgv3kk7u1/hole_ice_sysfits_deepcore_gauss.json?dl=0

$OUTPUTDIRECTORY can obviously be wherever you like...

steven-j-wren avatar Nov 29 '16 19:11 steven-j-wren

So now I've actually come to process my output I'm seeing another issue in the single precision mode. I had some trials where I thought the minimiser had obviously failed so I researched a function to remove "outliers" in the chi2 distributions since they were ones which had seemingly obviously stopped early. I had this problem for true NO but not for true IO and the only difference was that I had tried to run true NO jobs in fp32. Sure enough, when I deleted those pseudo-experiments from my output folder, the outlying trials all disappeared...

steven-j-wren avatar Dec 18 '16 13:12 steven-j-wren

I wonder if this last issue you see is a matter of minimizer settings needing to be different due to some degree of roughness introduced by lower precision. I think @philippeller (please correct me if I'm wrong, Philipp) had been using SLSQP for his minimizer which might be more robust to this (or it just happens that the settings he had chosen are more robust, and maybe the same thing can be achieved with L-BFGS-G with tweaks).

Due to all of our minimizer settings issues, I think it'd be worthwhile to start a new page on the wiki documenting our trials and tribulations with minimizers / settings for our analyses, so that there can be a central "lab notebook" collecting the little bits of wisdom each of us has had with all of our different configurations...

jllanfranchi avatar Dec 18 '16 14:12 jllanfranchi

Yeah you're probably right that that's unrelated to the issues above, I just wanted to bring it up somewhere and this seemed like the best place! 😄

steven-j-wren avatar Dec 18 '16 14:12 steven-j-wren

Added a page to the wiki here: https://github.com/jllanfranchi/pisa/wiki/minimizers_and_settings

jllanfranchi avatar Dec 18 '16 14:12 jllanfranchi

Possibly same issue here: https://github.com/scipy/scipy/issues/4873

jllanfranchi avatar Apr 03 '17 19:04 jllanfranchi

I can't see whether this issue persists.

LeanderFischer avatar May 07 '24 15:05 LeanderFischer

Let's close as "not planned" because there is no way to reproduce the original issue with the execution of hypo_testing.py. There doesn't seem to be anything we can do about the possible external minimiser issue raised above (apart from reintroducing a user documentation somewhere).

thehrh avatar Jul 31 '24 08:07 thehrh