Number of scale variations found (0) is invalid.
the processing of high priority requests is facing 5-10% of hard failures in the NANO step due to an infamous issue in number gen weights variations (https://github.com/cms-sw/cmssw/pull/46573)
- https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00510__v1_T_251110_220614_2021
- https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00511__v1_T_251208_220916_1559
- https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00336__v1_T_251110_220711_7956
with error like
%MSG
----- Begin Fatal Exception 30-Nov-2025 10:14:14 CET-----------------------
An exception of category 'LogicError' occurred while
[0] Processing global begin Run run: 1
[1] Calling method for module GenWeightsTableProducer/'genWeightsTable'
Exception Message:
Number of scale variations found (0) is invalid.
----- End Fatal Exception -------------------------------------------------
%MSG-w MemoryCheck: AfterModGlobalBeginRun 30-Nov-2025 10:14:14 CET Run: 1
error report can be found here : https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00510__v1_T_251110_220614_2021
job logs under : https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00510__v1_T_251110_220614_2021/8001/GEN-RunIII2024Summer24wmLHEGS-00510_0/4d1eb528-7bd5-4f84-a7fa-7655d84c4da1-530-0-logArchive/job/WMTaskSpace/
this should be caught much earlier in the gen step so that we do not waste the resource going all the way to nano and fail miserably then.
Log of the corresponding step1 job is under : https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00510__v1_T_251110_220614_2021/8001/GEN-RunIII2024Summer24wmLHEGS-00510_0/4d1eb528-7bd5-4f84-a7fa-7655d84c4da1-530-0-logArchive/job/WMTaskSpace/cmsRun1/
assign generators
assign xpog
New categories assigned: generators
@lviliani,@mkirsano,@sensrcn,@theofil you have been requested to review this Pull request/Issue and eventually sign? Thanks
cms-bot internal usage
A new Issue was created by @vlimant.
@Dr15Jones, @ftenchini, @makortel, @mandrenguyen, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
New categories assigned: xpog
@battibass,@ftorrresd you have been requested to review this Pull request/Issue and eventually sign? Thanks
an new instance of https://github.com/cms-sw/cmssw/issues/43784
in the step1, I noticed
NFO: load configuration from /srv/job/WMTaskSpace/cmsRun1/lheevent/process/Cards/amcatnlo_configuration.txt
Using default text editor "vi". Set another one in ./input/mg5_configuration.txt
No valid eps viewer found. Please set in ./input/mg5_configuration.txt
No valid web browser found. Please set in ./input/mg5_configuration.txt
process>INFO: Running Systematics computation
INFO: Idle: 1, Running: 3, Completed: 0 [ current time: 07h48 ]
[1;34mWARNING: program /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_8/external/slc7_amd64_gcc10/bin/python3 -O /srv/job/WMTaskSpace/cmsRun1/lheevent/process/bin/internal/systematics.py events.lhe ./tmp_2_events.lhe --start_id=1001 --pdf=325300,316200,306000@0,322500@0,322700@0,322900@0,323100@0,323300@0,323500@0,323700@0,323900@0,305800,303200@0,292200@0,331300,331600,332100,332300@0,332500@0,332700@0,332900@0,333100@0,333300@0,333500@0,333700@0,14000,14066@0,14067@0,14069@0,14070@0,14100,14200@0,14300@0,27400,27500@0,27550@0,93300,61200,42780,315000@0,315200@0,262000@0,263000@0 --mur=1,2,0.5 --muf=1,2,0.5 --together=muf,mur --dyn=-1 --start_event=10218 --stop_event=15327 --result=./log_sys_2.txt --lhapdf_config=/cvmfs/cms.cern.ch/slc7_amd64_gcc10/external/lhapdf/6.4.0-68defff11ffd434c73727d03802bfb85/share/LHAPDF/../../bin/lhapdf-config launch ends with non zero status: -9. Stop all computation [0m
INFO: Idle: 0, Running: 3, Completed: 1 [ 53.2s ]
INFO: Running Systematics computation
which might indicate that something was wrong in the systematics weights computation (leading to 0 weights in nano) but that this did not abort completely the job at that level, as "Stop all computation" indicates it should ?
Thanks @vlimant, this helps actually. I think it means that the check that was introduced here: https://gitlab.cern.ch/cms-gen/genproductions_scripts/-/blob/master/bin/MadGraph5_aMCatNLO/runcmsgrid_NLO.sh#L149-152 is not catching all the possible failures of the systematics module.
Anyways, I guess the -9 exit code means that the program was killed with a SIGKILL and we are probably not catching these signals with the current approach.
is it clear BTW that the exit 10086 from https://gitlab.cern.ch/cms-gen/genproductions_scripts/-/blob/master/bin/MadGraph5_aMCatNLO/runcmsgrid_NLO.sh#L149-152 from inside the gridpack will actually result in a non zero exit of https://github.com/cms-sw/cmssw/blob/CMSSW_14_0_X/GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh ?
set -e in https://github.com/cms-sw/cmssw/blob/CMSSW_14_0_X/GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh should do it, no?
But I think the problem is that these jobs are getting killed with SIGKILL in production during the systematics step, maybe because of the too large memory consumption when running a large chunk of events in the same job (but this is just an assumption).
This SIGKILL is killing systematics.py but we are not able to catch it in the runcmsgrid.sh script, which is therefore continuing and producing LHE events w/o the systematic weights.
Discussing with @DickyChant we identified 2 possible solutions:
- implement a way to catch the SIGKILL in
runcmsgrid.sh, which seems to be not as straightforward as I was assuming. - check at the end of
runcmsgrid.shif the LHE events have the weights stored. If not it means that the systematics module failed and we abort the job.
The 2nd one is probably the best option.
Addressing in https://gitlab.cern.ch/cms-gen/genproductions_scripts/-/issues/9 (since genproduction utilities are migrated)
NB : most of all failures of that sort happened at T2_US_Purdue, with much small fractions at T3_US_SDSC, T1_US_FNAL and T2_US_Caltech
we are wasting quite an enormous amount of resource on those requests.
https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00334__v1_T_251208_220920_7409 has joined the dance.
https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00334__v1_T_251208_220920_7409 has joined the dance.
Where can I find the logs for this one?
Is this also running in one of the sites you mentioned before?
I wondered the same if we just have one site or few sites that has a CVMFS issue.
And actually I am interested in how we collect the logs. For https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00510__v1_T_251110_220614_2021 I found only 1 out of 6 logs will survive until nano.
the frequency of logs in unified is biased and not telling. Looking into wmstats shows frequency and amount of failed jobs.