cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Sherpa related workflows get stuck due to a problem with opening an openmpi session

Open ArturAkh opened this issue 1 year ago • 5 comments

Dear all,

At KIT, we were seeing some problems with Sherpa related workflows at our opportunistic resources (KIT-HOREKA), e.g.

data.RequestName = cmsunified_task_SMP-RunIISummer20UL18GEN-00048__v1_T_240312_112234_8747

The jobs seem to hang with a CPU usage at 0%, leading to very low efficiency (below 20%) for HoreKa resources:

https://grafana-sdm.scc.kit.edu/d/qn-VJhR4k/lrms-monitoring?orgId=1&refresh=15m&var-pool=GridKa+Opportunistic&var-schedd=total&var-location=horeka&viewPanel=98&from=1717406527904&to=1717579327904

After some investigation of the situation, we have figured out the following:

  • Logs from CMSSW are empty, if connecting to the jobs themselves from HTCondor
  • This has to do with the fact, that CMSSW is calling an external process (e.g. cmsExternalGenerator extGen777_0 777_0), which is hanging and was identified to be a Sherpa process.
  • If trying to run the configuration ourselves on a machine we have control of we see the following errors:
A call to mkdir was unable to create the desired directory:

  Directory: /tmp/openmpi-sessions-12009@bms1_0/52106
  Error:     No space left on device

So the entire process is unable to open an openmpi session. Even more problematic is, that the job does not fail properly but is hanging (i.e. running further with 0% efficiency). We see often this message in the logs when running locally:

Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

According to our local physics group which had some experience with running Sherpa, this is a known problem.

Resetting the $TMPDIR variable to a different location was allowing us to make the process work properly if running it manually. We are not sure though, whether this is a correct action to be taken on an entire (sub)site for all worker nodes...

We would like to know, how to resolve this issue, and whether something needs to be done in terms of openmpi libraries in the CMSSW software stack for that.

Best regards,

Artur Gottmann

ArturAkh avatar Jun 07 '24 08:06 ArturAkh

cms-bot internal usage

cmsbuild avatar Jun 07 '24 08:06 cmsbuild

A new Issue was created by @ArturAkh.

@Dr15Jones, @antoniovilela, @makortel, @sextonkennedy, @rappoccio, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Jun 07 '24 08:06 cmsbuild

assign generator

Dr15Jones avatar Jun 07 '24 12:06 Dr15Jones

assign generators

Dr15Jones avatar Jun 07 '24 12:06 Dr15Jones

New categories assigned: generators

@alberto-sanchez,@bbilin,@GurpreetSinghChahal,@mkirsano,@menglu21,@SiewYan you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Jun 07 '24 12:06 cmsbuild

Ping @cms-sw/generators-l2

makortel avatar Jul 10 '24 16:07 makortel

@shimashimarin did we observe this also in our recent tests?

lviliani avatar Sep 06 '24 12:09 lviliani

Sorry for the late reply. I usually test the Sherpa processes locally or via private production. I have not encountered such an issue.

However, I noticed that OpenMPI is used here. the MPI parallelization is mainly to speed up the integration process, i.e. Sherpack generation. Parallelization of event generation can simply be done by starting multiple instances of Sherpa.

Therefore, I think it is not necessary to use OpenMPI sessions for Sherpa event generation. Maybe we can test the Sherpa event production without using OpenMPI?

shimashimarin avatar Sep 16 '24 14:09 shimashimarin

Maybe we can test the Sherpa event production without using OpenMPI?

Just to note that avoiding OpenMPI from Sherpa in CMSSW would also avoid this thread-unsafe workaround https://github.com/cms-sw/cmssw/blob/5587561cbbb5aa693b5982bf71e79fdb5a16ae06/GeneratorInterface/SherpaInterface/src/SherpackUtilities.cc#L154-L161 (reported in https://github.com/cms-sw/cmssw/issues/46002#issuecomment-2361848749)

makortel avatar Nov 06 '24 17:11 makortel

Is anyone looking into avoiding the use of OpenMPI from Sherpa during event production?

makortel avatar Jan 07 '25 14:01 makortel

Hi @makortel I haven't found time to work on it yet. But it seems that we can disable OpenMPI in SherpaHadronizer.cc. I will do some tests and let you know.

shimashimarin avatar Jan 08 '25 16:01 shimashimarin

https://github.com/cms-sw/cmssw/pull/47994 (removing MPI C++ bindings from Sherpa, among other components) reminded me to ask again if there has been any progress in removing the use of MPI from Sherpa during event generation?

makortel avatar Apr 30 '25 18:04 makortel

Has there been any progress in removing the use of MPI from Sherpa during event generation? (https://github.com/cms-sw/cmsdist/pull/10058 reminded me)

@cms-sw/generators-l2

makortel avatar Sep 04 '25 13:09 makortel

Hi @makortel , sorry for the delay on this. I've been tied up with analysis tasks. I can start picking this up slowly if nobody is working on this, but September will be busy with a job transfer. I'll be able to dedicate much more time to it starting in October and can work with the Sherpa authors then. Cheers, Jie for the Sherpa contact

shimashimarin avatar Sep 04 '25 15:09 shimashimarin

Another problem related to SherpaHadronizer making use of MPI came up in https://github.com/cms-sw/cmssw/issues/49332

makortel avatar Nov 06 '25 15:11 makortel