cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

DQM/Integration unit tests are failing in all releases but 12_6_X

Open perrotta opened this issue 3 years ago • 9 comments

DQM/Integration unit tests are failing in large number in all releases but 12_6_X, in all cases apparently independently from the PR merged in the meanwhile.

I observed it starting in: CMSSW_12_5_X_2022-10-04-1100 CMSSW_12_4_X_2022-10-03-2300 CMSSW_12_3_X_2022-09-30-1100 CMSSW_12_2_X_2022-10-03-2300

No such issue (yet?) in the master release. In all cases there were no PR merged for th IB when it appeared first, in particular we are not merging anything in 12_2_X and 12_3_X since a while.

A typical log:

edmFileUtil --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=xrootd --events /store/express/Commissioning2021/ExpressCosmics/FEVT/Express-v1/000/344/518/00000/8ae6d6f6-7859-4089-84dd-4a5d89deb5df.root | tail -n +9 | head -n -5 | awk '{ print $3 }'
Error in <TNetXNGFile::Open>: [ERROR] Server responded with an error: [3011] No servers are available to read the file.

----- Begin Fatal Exception 30-Sep-2022 12:04:01 CEST-----------------------
An exception of category 'ConfigFileReadError' occurred while
   [0] Processing the python configuration file named ./src/DQM/Integration/python/clients/beam_dqm_sourceclient-live_cfg.py
Exception Message:
 unknown python problem occurred.
IndexError: list index out of range

At:
  /cvmfs/cms-ib.cern.ch/nweek-02752/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_X_2022-09-30-1100/python/DQM/Integration/config/unittestinputsource_cfi.py(107): <module>
  <frozen importlib._bootstrap>(228): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(850): exec_module
  <frozen importlib._bootstrap>(695): _load_unlocked
  <frozen importlib._bootstrap>(986): _find_and_load_unlocked
  <frozen importlib._bootstrap>(1007): _find_and_load
  /cvmfs/cms-ib.cern.ch/nweek-02752/slc7_amd64_gcc10/cms/cmssw-patch/CMSSW_12_3_X_2022-09-30-1100/python/FWCore/ParameterSet/Config.py(722): load
  ./src/DQM/Integration/python/clients/beam_dqm_sourceclient-live_cfg.py(36): <module>

----- End Fatal Exception -------------------------------------------------

perrotta avatar Oct 07 '22 08:10 perrotta

A new Issue was created by @perrotta Andrea Perrotta.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Oct 07 '22 08:10 cmsbuild

assign dqm,externals

perrotta avatar Oct 07 '22 08:10 perrotta

New categories assigned: dqm,externals

@jfernan2,@ahmad3213,@micsucmed,@iarspider,@rvenditti,@smuzaffar,@emanueleusai,@syuvivida,@aandvalenzuela,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Oct 07 '22 08:10 cmsbuild

I have reproduced the issue, also with CMSSW_12_6_X_2022-10-04-1100 (no idea why it didn't fail in the IBs). However I don't know how to fix it, we need to wait until @smuzaffar is back.

iarspider avatar Oct 07 '22 12:10 iarspider

For the time being, I just reproduced the error in CMSSW_12_3_X_2022-09-30-1100 (after changing the input dataset in https://github.com/cms-sw/cmssw/blob/master/DQM/Integration/python/config/unittestinputsource_cfi.py#L41 to avoid the xrootd error), but we don't have any ideas of the reason why. I tried to run a couple of DQM clients without unit test, and they work properly.

rvenditti avatar Oct 07 '22 14:10 rvenditti

Could it be that dataset /ExpressCosmics/Commissioning2021-Express-v1/FEVT was recently deleted and now xrootd can not find such file any more? Note that we have cached this files in ibeos area but one need to use protocol=ibeos to access it e.g. the following works

edmFileUtil  --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=ibeos --events /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root

Or other solution is to backport the SITECONFIG_PATH changes https://github.com/cms-sw/cmssw/pull/37278#issuecomment-1249494617 to production releases e.g. 12.x/11.x release cycles.

smuzaffar avatar Oct 10 '22 15:10 smuzaffar

@makortel , @nhduongvn , @stlammel during Core SW meeting we decided to backport https://github.com/cms-sw/cmssw/pull/37278 changes to older release cycles too. Do you see any issues doing this ? I am not sure if all sites are ready and already have new data catalogs from rucio

smuzaffar avatar Oct 13 '22 08:10 smuzaffar

during Core SW meeting we decided to backport #37278 changes to older release cycles too. Do you see any issues doing this ?

Yes, that is the plan (see https://github.com/cms-sw/cmssw/pull/37278#issuecomment-1074259843).

Do you see any issues doing this ?

We need to be sure that the backports won't cause troubles in the old release cycles. I had earlier collected the list of fixes that need to be included in the backport in https://github.com/cms-sw/cmssw/pull/37278#issuecomment-1249494617, and this week a new issue on the subsite treatment in the site-local-config.xml was reported in https://cms-talk.web.cern.ch/t/crab-test-cmssw-12-6-x-invalid-site-local-config/15423/17. I've understood @nhduongvn would open a PR for the fix soon.

I am not sure if all sites are ready and already have new data catalogs from rucio

That was actually my precondition for signing #37278 that @stlammel confirmed in https://github.com/cms-sw/cmssw/pull/37278#issuecomment-1115299198 (although with 12_6_0_pre2 reality turned out to be more complicated).

makortel avatar Oct 13 '22 13:10 makortel

So, there was a campaign earlier this year to get storage.json files in place for all sites. Two sites had held out and they were put in place when this was discovered several week ago, as Matti wrote. During the sub-site issue last week i found obsolete entries at two sites and they were corrected. The SAM test to check SITECONF is ready and will go into production with the next token update. This should detect inconsistencies before users. (I didn't regard this high priority as we don't have this for the current SITECONF files either but them being active reveals issues promptly.) I would release CMSSW_12_6, make sure everything is fine before the backport of other releases.

  • Stephan

stlammel avatar Oct 13 '22 13:10 stlammel

Hi, All,

This still needs attention, is it still the case that @nhduongvn is preparing a fix here?

rappoccio avatar Oct 21 '22 13:10 rappoccio

Hi Sal, all, The fix was provided and merged: https://github.com/cms-sw/cmssw/pull/39727

nhduongvn avatar Oct 21 '22 13:10 nhduongvn

Thanks @nhduongvn, but we still need back ports to 12_5 and 12_4. @makortel is there some update there?

Otherwise, can we just move to a more recent file for the DQM checks and bypass this entirely to just use a more recent run that's still available? @cms-sw/dqm-l2 ?

rappoccio avatar Oct 21 '22 13:10 rappoccio

I would release CMSSW_12_6, make sure everything is fine before the backport of other releases.

@stlammel we won't release 12_6 until December, we can't really leave the IBs broken for 2 months.

rappoccio avatar Oct 21 '22 13:10 rappoccio

Hallo Sal, @rappoccio i am a bit confused: The old versions, including 12_4, 12_5, should work fine without the backport. Only the 12_6 pre-releases are broken and the next pre-release will fix this. Thanks,

  • Stephan

stlammel avatar Oct 21 '22 16:10 stlammel

Given the trouble we've had with https://github.com/cms-sw/cmssw/pull/37278 I'm not comfortable in backporting it (and all the necessary fixes) to 12_4_X or 12_5_X until the data taking is over (to avoid any risk for Tier0).

Said that, I think the unit tests would get fixed by just dropping the --catalog option to edmFileUtil, i.e. backporting just https://github.com/cms-sw/cmssw/pull/39266 . @smuzaffar The test machinery still sets CMS_PATH=/cvmfs/cms-ib.cern.ch, right? If that is the case, edmFileUtil will find the right storage.xml. I just tested

CMS_PATH=/cvmfs/cms-ib.cern.ch edmFileUtil  --events /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root

succeeds in CMSSW_12_5_X_2022-10-21-1100.

makortel avatar Oct 21 '22 18:10 makortel

Said that, I think the unit tests would get fixed by just dropping the --catalog option to edmFileUtil, i.e. backporting just https://github.com/cms-sw/cmssw/pull/39266 .

based on my private test[^1], this won't be sufficient to fix the unit tests.

[^1]: cmsrel CMSSW_12_5_X_2022-10-21-1100 cd CMSSW_12_5_X_2022-10-21-1100/src/ cmsenv git cms-addpkg DQM/Integration git cherry-pick 9a056d437411de96fc23edd6948539c0fbe0d166 scramv1 b -j 20 cd DQM/Integration/python/clients/ voms-proxy-init -voms cms cmsRun sistrip_dqm_sourceclient-live_cfg.py unitTest=True

mmusich avatar Oct 21 '22 19:10 mmusich

Right, dropping the --catelog option does not work for 12.5 and earliler releases. One simple fix is to either use a file known to das ( acessiable via xrootd redirectors ) or use ibeos protocol i.e. use --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=ibeos

smuzaffar avatar Oct 21 '22 22:10 smuzaffar

or use ibeos protocol i.e. use --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=ibeos

this indeed works. I have opened the following PRs:

  • https://github.com/cms-sw/cmssw/pull/39829 (12.5.X)
  • https://github.com/cms-sw/cmssw/pull/39830 (12.4.X)

Let me know if some other cycles could use an update.

mmusich avatar Oct 23 '22 13:10 mmusich

I still don't understand why just dropping the --catalog would not work. In CMSSW_12_5_X_2022-10-21-1100 I get

# this is what the test used before
$ edmFileUtil -d --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=xrootd /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root
root://cms-xrd-global.cern.ch//store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root

# with explicit ibeos
$ edmFileUtil -d --catalog file:/cvmfs/cms-ib.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=ibeos /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root
root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root

# dropping --catalog, setting CMS_PATH
$ CMS_PATH=/cvmfs/cms-ib.cern.ch edmFileUtil -d /store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root
root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/express/Run2022B/ExpressPhysics/FEVT/Express-v1/000/355/380/00000/b8a57fc4-5656-42b4-9b7b-2e647baf65e8.root

The last two cases resolve to exactly the same PFN.

Also running scram b use-ibeos runtests on DQM/Integration seems to work with dropping --catalog.

Anyway, given that https://github.com/cms-sw/cmssw/pull/39829 and https://github.com/cms-sw/cmssw/pull/39830 are already merged, there probably isn't practical need to continue the discussion (except maybe why the merge of #39829 did not cause this issue to close).

makortel avatar Oct 24 '22 15:10 makortel

Also running scram b use-ibeos runtests on DQM/Integration seems to work with dropping --catalog.

This didn't work for me, see https://github.com/cms-sw/cmssw/issues/39669#issuecomment-1287375943

mmusich avatar Oct 24 '22 16:10 mmusich

Also running scram b use-ibeos runtests on DQM/Integration seems to work with dropping --catalog.

This didn't work for me, see #39669 (comment)

I guess because the recipe in https://github.com/cms-sw/cmssw/issues/39669#issuecomment-1287375943 did not include overriding the CMS_PATH (that I expect scram b use-ibeos runtests to do, among other things).

makortel avatar Oct 24 '22 16:10 makortel

humm, yes dropping --catalog with correct CMS_PATH also worked for me .... no idea why I had the impression that this was not working.

smuzaffar avatar Oct 24 '22 16:10 smuzaffar

no idea why I had the impression that this was not working.

that's interesting, because when I first tried to drop --catalog (, i.e. backporting just https://github.com/cms-sw/cmssw/pull/39266) also I have the distinct impression that also scram b use-ibeos runtests wasn't working, then I passed to use single client tests (as in the recipe of https://github.com/cms-sw/cmssw/issues/39669#issuecomment-1287375943) in order to make tests run faster. I am wondering if some other thing was changed in the meanwhile, such that scram b use-ibeos runtests now also runs OK. At any rate I think that https://github.com/cms-sw/cmssw/pull/39829 is a superior fix, because other than letting the unit test run, also allows the single client to be tested in unit test mode directly, which is what generally developers use.

mmusich avatar Oct 25 '22 10:10 mmusich

Thanks a lot for the efforts here! I think we can now close the issue as the IBs are now correctly completing. Thanks everyone!

rappoccio avatar Oct 25 '22 11:10 rappoccio