cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Number of scale variations found (0) is invalid.

Open vlimant opened this issue 2 weeks ago • 20 comments

the processing of high priority requests is facing 5-10% of hard failures in the NANO step due to an infamous issue in number gen weights variations (https://github.com/cms-sw/cmssw/pull/46573)

  • https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00510__v1_T_251110_220614_2021
  • https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00511__v1_T_251208_220916_1559
  • https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00336__v1_T_251110_220711_7956

with error like

%MSG
----- Begin Fatal Exception 30-Nov-2025 10:14:14 CET-----------------------
An exception of category 'LogicError' occurred while
   [0] Processing global begin Run run: 1
   [1] Calling method for module GenWeightsTableProducer/'genWeightsTable'
Exception Message:
Number of scale variations found (0) is invalid.
----- End Fatal Exception -------------------------------------------------
%MSG-w MemoryCheck:  AfterModGlobalBeginRun 30-Nov-2025 10:14:14 CET Run: 1

error report can be found here : https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00510__v1_T_251110_220614_2021

job logs under : https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00510__v1_T_251110_220614_2021/8001/GEN-RunIII2024Summer24wmLHEGS-00510_0/4d1eb528-7bd5-4f84-a7fa-7655d84c4da1-530-0-logArchive/job/WMTaskSpace/

this should be caught much earlier in the gen step so that we do not waste the resource going all the way to nano and fail miserably then.

Log of the corresponding step1 job is under : https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00510__v1_T_251110_220614_2021/8001/GEN-RunIII2024Summer24wmLHEGS-00510_0/4d1eb528-7bd5-4f84-a7fa-7655d84c4da1-530-0-logArchive/job/WMTaskSpace/cmsRun1/

vlimant avatar Dec 10 '25 09:12 vlimant

assign generators

vlimant avatar Dec 10 '25 10:12 vlimant

assign xpog

vlimant avatar Dec 10 '25 10:12 vlimant

New categories assigned: generators

@lviliani,@mkirsano,@sensrcn,@theofil you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Dec 10 '25 10:12 cmsbuild

cms-bot internal usage

cmsbuild avatar Dec 10 '25 10:12 cmsbuild

A new Issue was created by @vlimant.

@Dr15Jones, @ftenchini, @makortel, @mandrenguyen, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Dec 10 '25 10:12 cmsbuild

New categories assigned: xpog

@battibass,@ftorrresd you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Dec 10 '25 10:12 cmsbuild

an new instance of https://github.com/cms-sw/cmssw/issues/43784

vlimant avatar Dec 10 '25 10:12 vlimant

in the step1, I noticed

NFO: load configuration from /srv/job/WMTaskSpace/cmsRun1/lheevent/process/Cards/amcatnlo_configuration.txt  
Using default text editor "vi". Set another one in ./input/mg5_configuration.txt
No valid eps viewer found. Please set in ./input/mg5_configuration.txt
No valid web browser found. Please set in ./input/mg5_configuration.txt
process>INFO: Running Systematics computation 
INFO:  Idle: 1,  Running: 3,  Completed: 0 [ current time: 07h48 ] 
[1;34mWARNING: program /cvmfs/cms.cern.ch/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_8/external/slc7_amd64_gcc10/bin/python3 -O /srv/job/WMTaskSpace/cmsRun1/lheevent/process/bin/internal/systematics.py events.lhe ./tmp_2_events.lhe --start_id=1001 --pdf=325300,316200,306000@0,322500@0,322700@0,322900@0,323100@0,323300@0,323500@0,323700@0,323900@0,305800,303200@0,292200@0,331300,331600,332100,332300@0,332500@0,332700@0,332900@0,333100@0,333300@0,333500@0,333700@0,14000,14066@0,14067@0,14069@0,14070@0,14100,14200@0,14300@0,27400,27500@0,27550@0,93300,61200,42780,315000@0,315200@0,262000@0,263000@0 --mur=1,2,0.5 --muf=1,2,0.5 --together=muf,mur --dyn=-1 --start_event=10218 --stop_event=15327 --result=./log_sys_2.txt --lhapdf_config=/cvmfs/cms.cern.ch/slc7_amd64_gcc10/external/lhapdf/6.4.0-68defff11ffd434c73727d03802bfb85/share/LHAPDF/../../bin/lhapdf-config launch ends with non zero status: -9. Stop all computation [0m
INFO:  Idle: 0,  Running: 3,  Completed: 1 [  53.2s  ] 
INFO: Running Systematics computation 

which might indicate that something was wrong in the systematics weights computation (leading to 0 weights in nano) but that this did not abort completely the job at that level, as "Stop all computation" indicates it should ?

vlimant avatar Dec 10 '25 10:12 vlimant

Thanks @vlimant, this helps actually. I think it means that the check that was introduced here: https://gitlab.cern.ch/cms-gen/genproductions_scripts/-/blob/master/bin/MadGraph5_aMCatNLO/runcmsgrid_NLO.sh#L149-152 is not catching all the possible failures of the systematics module.

lviliani avatar Dec 10 '25 10:12 lviliani

Anyways, I guess the -9 exit code means that the program was killed with a SIGKILL and we are probably not catching these signals with the current approach.

lviliani avatar Dec 10 '25 12:12 lviliani

is it clear BTW that the exit 10086 from https://gitlab.cern.ch/cms-gen/genproductions_scripts/-/blob/master/bin/MadGraph5_aMCatNLO/runcmsgrid_NLO.sh#L149-152 from inside the gridpack will actually result in a non zero exit of https://github.com/cms-sw/cmssw/blob/CMSSW_14_0_X/GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh ?

vlimant avatar Dec 10 '25 16:12 vlimant

set -e in https://github.com/cms-sw/cmssw/blob/CMSSW_14_0_X/GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh should do it, no?

lviliani avatar Dec 10 '25 17:12 lviliani

But I think the problem is that these jobs are getting killed with SIGKILL in production during the systematics step, maybe because of the too large memory consumption when running a large chunk of events in the same job (but this is just an assumption).

This SIGKILL is killing systematics.py but we are not able to catch it in the runcmsgrid.sh script, which is therefore continuing and producing LHE events w/o the systematic weights.

Discussing with @DickyChant we identified 2 possible solutions:

  • implement a way to catch the SIGKILL in runcmsgrid.sh, which seems to be not as straightforward as I was assuming.
  • check at the end of runcmsgrid.sh if the LHE events have the weights stored. If not it means that the systematics module failed and we abort the job.

The 2nd one is probably the best option.

lviliani avatar Dec 10 '25 17:12 lviliani

Addressing in https://gitlab.cern.ch/cms-gen/genproductions_scripts/-/issues/9 (since genproduction utilities are migrated)

DickyChant avatar Dec 11 '25 01:12 DickyChant

NB : most of all failures of that sort happened at T2_US_Purdue, with much small fractions at T3_US_SDSC, T1_US_FNAL and T2_US_Caltech

vlimant avatar Dec 11 '25 10:12 vlimant

we are wasting quite an enormous amount of resource on those requests.

vlimant avatar Dec 12 '25 09:12 vlimant

https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00334__v1_T_251208_220920_7409 has joined the dance.

vlimant avatar Dec 12 '25 09:12 vlimant

https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00334__v1_T_251208_220920_7409 has joined the dance.

Where can I find the logs for this one?

Is this also running in one of the sites you mentioned before?

lviliani avatar Dec 12 '25 09:12 lviliani

I wondered the same if we just have one site or few sites that has a CVMFS issue.

And actually I am interested in how we collect the logs. For https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_GEN-RunIII2024Summer24wmLHEGS-00510__v1_T_251110_220614_2021 I found only 1 out of 6 logs will survive until nano.

DickyChant avatar Dec 12 '25 09:12 DickyChant

the frequency of logs in unified is biased and not telling. Looking into wmstats shows frequency and amount of failed jobs.

vlimant avatar Dec 12 '25 09:12 vlimant