cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

ExternalGeneratorFilter timing out

Open vlimant opened this issue 8 months ago • 16 comments

they were complications during the validation of similar requests and https://cms-pdmv-prod.web.cern.ch/mcm/requests?prepid=BPH-Run3Summer22EEGS-00172 is failing massively on:

Fatal Exception (Exit code: 8001)
An exception of category 'ExternalFailed' occurred while
[0] Processing Event run: 1 lumi: 16885 event: 3889891464 stream: 3
[1] Running path 'simulation_step'
[2] Calling method for module ExternalGeneratorFilter/'generator'
Exception Message:
failed waiting for external process 997_3. Timed out after 300 seconds.

https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_BPH-Run3Summer22EEGS-00172__v1_T_250416_160139_6446#BPH-Run3Summer22EEGS-00172_0:8001

log files, including the configuration can be found under https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_BPH-Run3Summer22EEGS-00172__v1_T_250416_160139_6446/8001/BPH-Run3Summer22EEGS-00172_0/

it might go away with resubmissions but it's not efficient for anyone. This is a 12.4.24 request, but the issue is also present in more recent release (14.0 at least)

vlimant avatar Apr 25 '25 08:04 vlimant

cms-bot internal usage

cmsbuild avatar Apr 25 '25 08:04 cmsbuild

A new Issue was created by @vlimant.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Apr 25 '25 08:04 cmsbuild

assign generators

vlimant avatar Apr 25 '25 08:04 vlimant

New categories assigned: generators

@bbilin,@lviliani,@menglu21,@mkirsano you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Apr 25 '25 08:04 cmsbuild

One of the logs contains the following just before the failure

%MSGEvtGen:Tried 10000000 times to generate decay of pi0 with mass=0
EvtGen:Will take first kinematically allowed decay in the decay table
EvtGen:Could not decay:pi0 with mass:0 will throw event away! 
EvtGen:Your event has been rejected 10000 times!
EvtGen:Will now abort.
-w EvtGenInterface::decay error: incorrect status!: 

%MSG
901_5 process: SIGNAL CAUGHT: unlock
%MSG-w EvtGenInterface::decay error: empty end vertex!: 
Worker: SIGNAL CAUGHT 6

Where process 901_5 is the one that was unresponsive for 300s.

NOTE: signal 6 is an abort so it looks like the EvtGen code called abort().

Dr15Jones avatar Apr 25 '25 14:04 Dr15Jones

A Quick Look at evtgen code and one easily finds

https://github.com/cms-externals/evtgen/blob/55e2778c3ced7e21ea580978536b4d712fa395de/src/EvtGen.cpp#L195-L199

Dr15Jones avatar Apr 25 '25 19:04 Dr15Jones

Hi, let me add 5 cents as BPH MC contact. while this error with EvtGen is expected to be related with generator (fragment, etc), its origin is highly unclear for me. If smth very wrong with generator we would expect failure with 100% cases. However local tests are good and we have a lot of successful jobs, so the failure happens only partially, and I do not understand how to control it. Digging a bit deeper to the logs:

%MSG
%MSG-w EvtGenInterface::decay error: empty end vertex!:
%MSG
Begin processing the 2034068th record. Run 1, Event 4273656605, LumiSection 72178 on stream 6 at 18-Apr-2025 18:36:43.976 UTC
EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0
EvtGen:Will take first kinematically allowed decay in the decay table
EvtGen:Could not decay:pi0 with mass:0 will throw event away!
EvtGen:Your event has been rejected 10000 times!
EvtGen:Will now abort.
991_5 process: SIGNAL CAUGHT: unlock
Worker: SIGNAL CAUGHT 6
Begin processing the 2034069th record. Run 1, Event 4273656606, LumiSection 72178 on stream 0 at 18-Apr-2025 18:36:43.982 UTC
----- Begin Fatal Exception 18-Apr-2025 18:36:43 UTC------------------—---

An exception of category 'ExternalFailed' occurred while
   [0] Processing  Event run: 1 lumi: 72178 event: 4272530394 stream: 5
   [1] Running path 'simulation_step'
   [2] Calling method for module ExternalGeneratorFilter/'generator'
Exception Message:
failed waiting for external process 991_5. Timed out after 300 seconds.
----- End Fatal Exception --------------------------------------------——-
%MSG-w EvtGenInterface::decay error: incorrect status!:

%MSG
%MSG-w EvtGenInterface::decay error: empty end vertex!:

%MSG
%MSG-w EvtGenInterface::decay error: incorrect status!:

%MSG
%MSG-w EvtGenInterface::decay error: empty end vertex!:

We see additionally warnings related with this check: https://github.com/cms-sw/cmssw/blob/e24f89532f74cdf48be07f887860188dab442608/GeneratorInterface/EvtGenInterface/plugins/EvtGen/EvtGenInterface.cc#L528-L534 I may suggest it happens since we force to decay K0L which is long-lived and normally decays by Geant etc. I would also naively suggest, that while I do not deeply understand these warnings, maybe they are not a show-stopper for the successful production. Still it seems that in some cases the events were produced in the end:

------------------------------------
GenXsecAnalyzer:
------------------------------------
Before Filter: total cross section = 5.596e+10 +- 0.000e+00 pb
Filter efficiency (taking into account weights)= (10) / (2.03407e+06) = 4.916e-06 +- 1.555e-06
Filter efficiency (event-level)= (10) / (2.03407e+06) = 4.916e-06 +- 1.555e-06    [TO BE USED IN MCM]

The efficiency is very low, however it is expected from physics of the filters applied.

The error with EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0 is something really odd. There is no pi0 related with the generated decay, no way our fragment can cause pi0 mass change. I have found some old LHC@ home fora with some mentions of this error: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=260&sort_style=&start=0 https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=306&sort_style=&start=60

From these discussions I may suggest that the problem can be related not with the physics / generator itself but with some computing multithreaded environment, also maybe some memory issues? This would explain why we have successful jobs in many cases, if this is configuration-depended.

Hope this may help a bit to understand.

Kiarendil avatar Apr 26 '25 15:04 Kiarendil

computing multithreaded environment

In this configuration, the code associated with EvtGen is run in its own single threaded application. This was done precisely to avoid EvtGen's threading issues.

Dr15Jones avatar Apr 26 '25 15:04 Dr15Jones

@Kiarendil While not a showstopper, we need to figure out why we need to throw away 1M CPUh into running this workflow and generator configuration. Throwing failures at computing operation is not the way I see things, when we can avoid them. Then they can concentrate on the things that matters. Last but not least, I would not trust 100% a sample that is generated with a configuration that gives pions no mass ; it might bias the analysis in very pernicious ways.

vlimant avatar Apr 26 '25 17:04 vlimant

Last but not least, I would not trust 100% a sample that is generated with a configuration that gives pions no mass ; it might bias the analysis in very pernicious ways.

AFAIK, there is no configuration which gives pion no mass, nowhere in Pythia or EvtGen You may see Pythia output in the start of the log — pi0 is with m = 0.13498 GeV:

 111  pi0                                 1   0   0    0.13498    0.00000    0.13498    0.13498  2.55313e-05   0   0   0   1   0
              0     1   0.9879900    0       22       22
              1     1   0.0119800   11       22       11      -11
              2     1   0.0000300   13       11      -11       11      -11

The fragment uses EvtGen particle data file which again has m = 013498 GeV:

add  p Particle  pi0                              111  1.3497700e-01  7.7300000e-09  0.0000000e+00     0     0  0.0000000e+00        111

While we do not know why somehow EvtGen starts to state its mass is zero and abort the process, I'm not sure we can trust it and it is not some strange artefact related with memory error etc.

While not a showstopper, we need to figure out why we need to throw away 1M CPUh into running this workflow and generator configuration.

I'm not sure I fully understand here. I agree it would be great to understand these problems, my only point was to highlight that it seems these warning does not stop cmsRun from producing the output with expected low efficiency, so this can be a clue for better understanding the problem and its origin. There are errors which results with 0 output, and this is not the one. So maybe the warnings with %MSG-w EvtGenInterface::decay error: empty end vertex!: are not the reason of the failure in general, but rather expected for the long-lived decays?...

I'd like also to highlight the efficiency is still expected to be low, so in general we had to waste a lot of Computing resources for such rare decays of long-lived particles even if we solve these failures, since we forcing the decays within tracker for the particles which normally fly to HCAL. You may see in JIRA CMSBPHMC-91: if we use decays with softer filters, it results in too little amount of reconstructed event.

As far as I understand, there are several similar samples with rare K0 decays were already produced and the analyzers team is using them successfully.

FYI @drkovalskyi , maybe you can comment on such kind of requests as member of the analysis team and BPH & Comps L2.

Kiarendil avatar Apr 26 '25 18:04 Kiarendil

Important thing also is that these failures are hard to reproduce locally ... So this is why I think (maybe naively) why Computing configuration maybe more crucial than the physics of the fragment / generator.

Kiarendil avatar Apr 26 '25 18:04 Kiarendil

Hi, I want to confirm that analysers used previously produced similar samples with K0 —> pi pi (mu mu) decays, and they are OK from their point of view it terms of physics etc, despite their low-efficiency and some warnings like %MSG-w EvtGenInterface::decay error: empty end vertex!: (which we do not understand completely). So I would suggest we rather trust its physics content, and I would believe that all Pythia+EvtGen settings are OK (basing on me experience with many other EvtGen fragment and after many private checks during last weeks it is really hard to find some error here). I believe the problem is rather Computing Operations-related like.

I can suggest the following ideas:

  • Ignore warnings %MSG-w EvtGenInterface::decay error: empty end vertex!: (likely due to force of long-lived decays)
  • I may try a local CRAB run with GS step only, e.g. 100 jobs with 8 threads, 3M events each. Maybe we would be able to get some info from jobs behaviour (failed / finished) and by comparison of different logs.

Kiarendil avatar Apr 28 '25 16:04 Kiarendil

the amount of job failures https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_BPH-Run3Summer22EEGS-00172__v1_T_250416_160139_6446 is horrendous

vlimant avatar Jun 11 '25 11:06 vlimant

I took another look at one of the logs for the failing job. Turns out message

EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0

appeared 96001 times in the log. So 96000 times the program was able to eventually get a kinematically allowed decay to happen so it proceeded. The job processed 2M events. So lets assume that each event that didn't report a problem only had EvtGen try ones, therefor EvtGen would have tried 210^6 + 10^79.6*10^4 ~= 10^12 events before it hit one it ultimately decided it couldn't process.

So yes, the failure is extremely rare but the shear astronomical number of tries being requested by this workflow makes the failure nearly inevitable.

Dr15Jones avatar Jun 11 '25 13:06 Dr15Jones

Hi, I've launched a CRAB task 250613_145945:kivanov_crab_EvtGen_testKLToMuMu for the private GS production of this fragment.

There was 100 jobs with 3M events each. We see (Monit link) 96 produced successfully with filter eff ~5e-06. 4 jobs (ids 5, 19, 75, 94) failed with the known 8001 error, as can be seen from their logs (e.g. job_out.94.0.txt):

== CMSSW: EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0
== CMSSW: EvtGen:Will take first kinematically allowed decay in the decay table
== CMSSW: EvtGen:Could not decay:pi0 with mass:0 will throw event away!
== CMSSW: EvtGen:Your event has been rejected 10000 times!
== CMSSW: EvtGen:Will now abort.
== CMSSW: 538_3 process: SIGNAL CAUGHT: unlock
== CMSSW: Worker: SIGNAL CAUGHT 6
== CMSSW: ----- Begin Fatal Exception 13-Jun-2025 17:23:10 UTC-----------------------
== CMSSW: An exception of category 'ExternalFailed' occurred while
== CMSSW:    [0] Processing  Event run: 1 lumi: 56044 event: 280217816 stream: 3
== CMSSW:    [1] Running path 'simulation_step'
== CMSSW:    [2] Calling method for module ExternalGeneratorFilter/'generator'
== CMSSW: Exception Message:
== CMSSW: failed waiting for external process 538_3. Timed out after 300 seconds.
== CMSSW: ----- End Fatal Exception -------------------------------------------------

It looks like some events have been produced still:

== CMSSW: ------------------------------------
== CMSSW: GenXsecAnalyzer:
== CMSSW: ------------------------------------
== CMSSW: Before Filter: total cross section = 5.596e+10 +- 0.000e+00 pb
== CMSSW: Filter efficiency (taking into account weights)= (11) / (1.22e+06) = 9.016e-06 +- 2.719e-06
== CMSSW: Filter efficiency (event-level)= (11) / (1.22e+06) = 9.016e-06 +- 2.719e-06    [TO BE USED IN MCM]

So this job worked successfully for the first 1.2M (of 3M) events, then for some reason stuck with this mysterious pi0 error and failed. I've checked another failed jobs, all have pi0 error:

  • Job 75 failed after 175k events
  • Job 19 failed after 2.4M events
  • Job 5 failed after ~3M events (Filter efficiency (event-level)= (16) / (3e+06), I assume it was a bit less then 3M due to rounding issues)

Unfortunately I still cannot understand why and how this tedious 8001 error with pi0 mass = 0 happens, it seems semi-random. Maybe some Computing-related issue of the EvtGen (related with CPU / Memory / stack overflow / floating / etc ), as I still cannot see anything wrong with the fragment, and analyzers who work with similar samples do not see anything wrong with the generated content.

Also unfortunately at that moment I can hardly suggest any further checks or ideas how we can deal with it. @DickyChant suggested to take a look at https://github.com/cms-externals/evtgen/blob/55e2778c3ced7e21ea580978536b4d712fa395de/src/EvtGen.cpp#L196-200

if ( times == 10000 ) {
            EvtGenReport( EVTGEN_ERROR, "EvtGen" )
                << "Your event has been rejected 10000 times!" << endl;
            EvtGenReport( EVTGEN_ERROR, "EvtGen" ) << "Will now abort." << endl;
            ::abort();

We can try to ? remove this check ? locally and launch again CRAB with modified CMSSW, however I am not sure that this is a solution, and that will not break anything.

Kiarendil avatar Jun 16 '25 10:06 Kiarendil

If we were to modify EvtGen, I'd first suggest adding more context to the failure message. Printing out the whole decay chain up to and including the particle that couldn't be decayed would go a long way to diagnosing the problem. It would also be nice if instead of calling ::abort we could throw an exception and then just try another event from the top, but it is unlikely that EvtGen is exception safe.

Dr15Jones avatar Jun 16 '25 13:06 Dr15Jones

somewhat related. ;at least it’s the same error message coming from https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_HIG-Run3Summer22EEGS-00037__v1_T_250517_210739_6514

Fatal Exception (Exit code: 8001)
An exception of category 'ExternalFailed' occurred while
[0] Processing Event run: 1 lumi: 15844 event: 11534195 stream: 6
[1] Running path 'simulation_step'
[2] Calling method for module ExternalGeneratorFilter/'generator'
Exception Message:
failed waiting for external process 997_6. Timed out after 300 seconds.

although it does not look like an EvtGen, but Sherpa instead

vlimant avatar Jul 15 '25 11:07 vlimant

Though the exception message failed waiting for external process 997_6. Timed out after 300 seconds. is the best I can do inside cmsRun, it isn't particularly illuminating. To get a better understanding of what happens, you must look in the log at the output coming from the external generators. Looking at https://cms-unified.web.cern.ch/cms-unified/joblogs/cmsunified_task_HIG-Run3Summer22EEGS-00037__v1_T_250517_210739_6514/8001/HIG-Run3Summer22EEGS-00037_0/00092777-3ff5-429a-808c-7528577d94d8-102-0-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log we see

 Extracting file MIG_P+P+_13600_LHA[NNPDF30_nnlo_as_0118]_2.db
 Extracting file MPI_Cross_Sections.dat
 Extracting dir Process/
 Extracting file Process/Comix.db
 Extracting file Process/Sherpa.db
 Extracting dir Result/
 Extracting file Result.db
 Extracting file Run.dat
End of ./
906_5 process: SIGNAL CAUGHT: unlock
Worker: SIGNAL CAUGHT 4

where a signal 4 in linux happens when a CPU tries to execute an illegal (i.e. unknown to it) machine instruction.

So it looks like that Sherpa based executable was built against the wrong architecture.

Dr15Jones avatar Jul 15 '25 13:07 Dr15Jones

although it does not look like an EvtGen, but Sherpa instead

...

Though the exception message failed waiting for external process 997_6. Timed out after 300 seconds. is the best I can do inside cmsRun, it isn't particularly illuminating. To get a better understanding of what happens, you must look in the log at the output coming from the external generators.

Given that the earlier problem was about EvtGen, and this new problem looks completely unrelated, it would be better to open a new issue for that.

makortel avatar Jul 15 '25 13:07 makortel

So it looks like that Sherpa based executable was built against the wrong architecture.

AFAICT, the Sherpa gridpacks do not include any executable code, the Sherpa executables are all part of external packages. That failing job ran on a Xeon X5647 at MIT, which is a fairly old CPU that only goes up to SSE4.2. I'm unable to reproduce the problem on a modern system, so my best guess is that Sherpa, or something it links to, was built for x86-64-v3.

I agree this should go into a new issue.

dan131riley avatar Jul 15 '25 15:07 dan131riley

Hello, let me share here some recent observations we found for such low filter efficiency MC production with EvtGen: https://gitlab.cern.ch/cms-ppd/dataset-management/simulation-production/-/issues/51

In short, the problem discussed here may be related with unreasonably high value of process.source.numberEventsInLuminosityBlock which is set by McM (it divides the default value by the filter efficiency).

Kiarendil avatar Aug 29 '25 15:08 Kiarendil

"unreasonable" or not becomes a question of if not having the needed filtering hooks in generators to either appropriately bias pdfs or to drop events efficiently at an early stage should be solved by some other piece of infrastructure and/or lots of compute or if it should be solved in the generator, no? [indeed having 1 event output per 10s of hours of cpu does solve a lot of limitations as the tests referenced demonstrate]

davidlange6 avatar Aug 29 '25 16:08 davidlange6

regarding "unreasonably high value", I think you are mistaken. The value is set very reasonably so that the output dataset has more events than lumisection (preventing the lumi branches and DBS to blow up completely). the generator needs to be able to run enough numbers in one go - period. alternatively even if the FWK is able to count the events/lumi "post-generator filter" than we could lower this number, but still, the generator would have to run enough numbers to fill enough events/lumi ; for the constrain mention above.

bottom line : let's fix EvtGen

vlimant avatar Sep 01 '25 12:09 vlimant

Adding another EvtGen request (BPH-Run3Summer22GS-00178) that - among the various failures - shows the same error reported at the beginning of the discussion, namely:

Fatal Exception (Exit code: 8001)
An exception of category 'ExternalFailed' occurred while
[0] Processing Event run: 1 lumi: 193536 event: 2704978118 stream: 7
[1] Running path 'simulation_step'
[2] Calling method for module ExternalGeneratorFilter/'generator'
Exception Message:
failed waiting for external process 896_7. Timed out after 300 seconds.

This can be retrieved from both the cms-unified error summary and the cms-unified log

mlizzo avatar Sep 09 '25 09:09 mlizzo

so is there no way to bias the pt of the b-quarks coming from pythia to improve things? There are plenty of kaons in B deacys, but the muons produced by the decay file are typically going to be nearly at rest in the B reference frame

davidlange6 avatar Sep 09 '25 10:09 davidlange6

otherwise

%MSG-w EvtGenInterface::decay error: incorrect status!:

%MSG
%MSG-w EvtGenInterface::decay error: empty end vertex!:

are definitely signs of physics problems.

davidlange6 avatar Sep 09 '25 11:09 davidlange6

This is not a kaon from a b-decay, but rather a prompt kaon. Can you give a bit more details about the decay errors and their meaning?

drkovalskyi avatar Sep 09 '25 12:09 drkovalskyi

Hi @davidlange6 , these specific requests are just using SoftQCD:nonDiffractive = on" and then PythiaFilter for a kaon presence. So we do not have a b-quark , kaons can be either prompt (most likely), from B or from charm. In this case we do not apply the "usual" for the EvtGen requests PTFilter which helps us to filter Pythia events with b or c quarks only (and also to control some pt scale with PTFilter:scaleToFilter).

Kiarendil avatar Sep 09 '25 12:09 Kiarendil

I can't really - no. but if you are sending K_L and claim to decay them to 2 muons, then there should be no pi0 at all.. So thats weird - did you convince yourselves that the decay trees coming from EvtGen are as expected?

anyway since you aren't using any of the helicity tracking/ decay tree functionality in EvtGen, why not just do this directly in Pythia and avoid all the data passing between the two?

[otherwise, replace B with K in my question about biasing above:) - Filters are presumably all applied after the fact (eg, after even more data conversion/handling) rather than inside the generator]

davidlange6 avatar Sep 09 '25 12:09 davidlange6