Fermitools-conda icon indicating copy to clipboard operation
Fermitools-conda copied to clipboard

fermitools on cluster w. slurm can't use as many CPUs as they should be able to

Open robincorbet opened this issue 1 year ago • 0 comments

I installed the fermitools and ftools on NASA's NCCS discover cluster. On this machine a user can use up to 6300 CPUs via up to 25 jobs using slurm. However, I find that executing more than ~25 or so scripts from an individual job causes a variety of problems. This means I'm limited to ~25x25 = 625 CPUs, i.e. an order of magnitude below the limit. Since I'm looking at all cataloged sources, my code naturally parallelizes to 6658 scripts for 4FGL DR3.

My slurm scripts are of the form:

discover32[1003] more slurm_script1.sh

#!/usr/bin/csh #To run this type: sbatch slurm_script1.sh #SBATCH --time=11:59:00 #SBATCH -o slurm_job_1.out #SBATCH -e slurm_job_1.err #SBATCH --ntasks-per-node=20 #SBATCH --ntasks=101 source ~rcorbet/.cshrc ftools conda activate fermi cd /gpfsm/dnb31/rcorbet/Fermi/Prod bex2020nd -batch -debug -file bex1.par >& klog1.txt & . . . bex2020nd -batch -debug -file bex100.par >& klog100.txt & wait $

The bex2020nd scripts call gtselect, gtmktime, gtsrcprob, gtbin, gtexposure, gtbary and fdump.

For running e.g. 100 bex2020nd scripts from a single slurm_scipt1.sh I find that the execution speed of each bex2020nd script is very slow. In addition, I get rather unpredictable errors such as: klog16.txt: gtexposure chatter=1 infile="lc_4FGLJ0001.2-0747.fits" scfile="lat_spacecraft_merged.fits" irfs="CALDB" srcmdl="tmp_77784FGLJ0001.2-0747. bex_LATxmlmodel.xml" target=_4FGLJ0001.2-0747 Caught N3tip12TipExceptionE at the top level: Could not open FITS extension "lat_spacecraft_merged.fits[SC_DATA][col START;STOP;LIVETIME; RA_SCZ;DEC_SCZ;RA_SCX;DEC_SCX][(START >= 239557417.5) && (STOP <= 626835668)]" (CFITSIO ERROR 113: could not allocate memory)

klog35.txt: gtmktime chatter=1 scfile="lat_spacecraft_merged.fits" filter="(DATA_QUAL>0) && ABS(ROCK_ANGLE)<90 && (LAT_CONFIG==1) && (angsep(RA_ZENIT H,DEC_ZENITH,9.9860001E-01,-1.1825100E+01)+3<105) && (angsep(9.9860001E-01,-1.1825100E+01,RA_SUN,DEC_SUN)>5+3) && (angsep(9.9860001E-01,- 1.1825100E+01,RA_SCZ,DEC_SCZ)<180)" roicut=n evfile="tmp_77974FGLJ0003.9-1149.temp2.fits" outfile="tmp_77974FGLJ0003.9-1149.temp3.fits" Caught N3tip12TipExceptionE at the top level: Could not open FITS extension "lat_spacecraft_merged.fits[SC_DATA][(DATA_QUAL>0) && ABS(ROC K_ANGLE)<90 && (LAT_CONFIG==1) && (angsep(RA_ZENITH,DEC_ZENITH,9.9860001E-01,-1.1825100E+01)+3<105) && (angsep(9.9860001E-01,-1.1825100E+ 01,RA_SUN,DEC_SUN)>5+3) && (angsep(9.9860001E-01,-1.1825100E+01,RA_SCZ,DEC_SCZ)<180)]" (CFITSIO ERROR 106: error writing to FITS file)

klog48.txt: gtmktime chatter=1 scfile="lat_spacecraft_merged.fits" filter="(DATA_QUAL>0) && ABS(ROCK_ANGLE)<90 && (LAT_CONFIG==1) && (angsep(RA_ZENIT H,DEC_ZENITH,1.7680000E+00,7.3051201E+01)+3<105) && (angsep(1.7680000E+00,7.3051201E+01,RA_SUN,DEC_SUN)>5+3) && (angsep(1.7680000E+00,7.3 051201E+01,RA_SCZ,DEC_SCZ)<180)" roicut=n evfile="tmp_78104FGLJ0007.0p7303.temp2.fits" outfile="tmp_78104FGLJ0007.0p7303.temp3.fits" Caught N3tip12TipExceptionE at the top level: Could not open FITS extension "lat_spacecraft_merged.fits[SC_DATA][(DATA_QUAL>0) && ABS(ROC K_ANGLE)<90 && (LAT_CONFIG==1) && (angsep(RA_ZENITH,DEC_ZENITH,1.7680000E+00,7.3051201E+01)+3<105) && (angsep(1.7680000E+00,7.3051201E+01 ,RA_SUN,DEC_SUN)>5+3) && (angsep(1.7680000E+00,7.3051201E+01,RA_SCZ,DEC_SCZ)<180)]" (CFITSIO ERROR 106: error writing to FITS file)

I also find that the top script sometimes restarts and runs all the bex2020nd right from the beginning, creating new sets of output files. I can tell this because the bex2020nd scripts create temporary files that include both the PID and the source name.

If, instead of having of running 100 bex2020nd scripts from a single "slurm_script1.sh", I run 5 slurm_scriptn.sh scripts, which each run 20 bex2020nd scripts, I have no problems, even though I'm running twice as many bex2020nd scripts, and things run much faster. Note that the slurm_script1.sh script specifies 20 tasks per node, so I don't think it's a node itself that is getting overloaded. Also note that the bex2020nd scripts aren't calling gtdiffrsp, which I previously found can lock up a machine if more than ~25 instances are called. (I'm using the weekly files with the diffuse columns already added.)

It would be great to be able to fully exploit the full number of CPUs I should be able to on discover!

robincorbet avatar Dec 15 '22 16:12 robincorbet