cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Cut parser error in ROOT master IB

Open makortel opened this issue 3 years ago • 40 comments

Workflow 1325.61 step 2 fails in CMSSW_11_3_ROOT6_X_2021-03-04-2300 with

----- Begin Fatal Exception 05-Mar-2021 08:39:28 CET-----------------------
An exception of category 'Configuration' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 6 stream: 3
   [1] Running path 'dqmoffline_step'
   [2] Calling method for module NanoAODDQM/'nanoDQMMC'
Exception Message:
Cut parser error:no method or data member named "getAnyValue" found for type "nanoaod::FlatTable::RowView" (char 0)
----- End Fatal Exception -------------------------------------------------

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_11_3_ROOT6_X_2021-03-04-2300/pyRelValMatrixLogs/run/1325.61_TTbar_13_106Xv1NanoAODINPUT+TTbar_13_106Xv1NanoAODINPUT+NANOAODMC2017_106XMiniAODv1/step2_TTbar_13_106Xv1NanoAODINPUT+TTbar_13_106Xv1NanoAODINPUT+NANOAODMC2017_106XMiniAODv1.log#/

makortel avatar Mar 05 '21 16:03 makortel

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Mar 05 '21 16:03 cmsbuild

assign core, xpog

makortel avatar Mar 05 '21 16:03 makortel

FYI @pcanal

makortel avatar Mar 05 '21 16:03 makortel

New categories assigned: core,xpog

@Dr15Jones,@smuzaffar,@fgolf,@mariadalfonso,@makortel,@gouskos you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Mar 05 '21 16:03 cmsbuild

assign dqm

makortel avatar Mar 05 '21 16:03 makortel

New categories assigned: dqm

@jfernan2,@andrius-k,@ahmad3213,@kmaeshima,@rvenditti,@ErnestaP you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Mar 05 '21 16:03 cmsbuild

FYI @peruzzim

jfernan2 avatar Mar 08 '21 11:03 jfernan2

Another occurrance in CMSSW_11_3_DEVEL_X_2021-03-17-2300 1325.6 step 2

----- Begin Fatal Exception 18-Mar-2021 04:00:15 CET-----------------------
An exception of category 'Configuration' occurred while
   [0] Processing  Event run: 1 lumi: 101 event: 5006 stream: 1
   [1] Running path 'dqmoffline_step'
   [2] Calling method for module NanoAODDQM/'nanoDQMMC'
Exception Message:
Cut parser error:no method or data member named "getAnyValue" found for type "nanoaod::FlatTable::RowView" (char 0)
----- End Fatal Exception -------------------------------------------------

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_11_3_DEVEL_X_2021-03-17-2300/pyRelValMatrixLogs/run/1325.6_TTbar_13_94Xv1NanoAODINPUT+TTbar_13_94Xv1NanoAODINPUT+NANOAODMC2017_94XMiniAODv1/step2_TTbar_13_94Xv1NanoAODINPUT+TTbar_13_94Xv1NanoAODINPUT+NANOAODMC2017_94XMiniAODv1.log#/

makortel avatar Mar 18 '21 12:03 makortel

@gpetruc according to github history, you were the one who introdudec nanoDQMC in the code, could you please have a look or point us for a responsible? Thanks

jfernan2 avatar Mar 18 '21 12:03 jfernan2

Is this issue still valid? Thanks

jfernan2 avatar Oct 28 '21 08:10 jfernan2

We still see this exception intermittently, the latest I could find is last week:

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/cc8_aarch64_gcc9/CMSSW_12_1_X_2021-10-22-1100/pyRelValMatrixLogs/run/10801.0_SingleElectronPt10+2018+SingleElectronPt10_pythia8_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano/step6_SingleElectronPt10+2018+SingleElectronPt10_pythia8_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano.log#/159-159

dan131riley avatar Oct 28 '21 09:10 dan131riley

Thanks @dan131riley However I don't reproduce the error with that same RelVal and release doing: runTheMatrix.py -l 10801.0 -i all --ibeos at least after 10 events (the one you quoted crashed at event 4th) so I understand it depends on the event in a random way

On the other hand, the error seems related to these lines: https://github.com/cms-sw/cmssw/blob/master/DataFormats/NanoAOD/interface/FlatTable.h#L94-L101 But I fail to see why. I wonder if the developer @gpetruc could shred some light

jfernan2 avatar Oct 28 '21 10:10 jfernan2

The IBs run with 4 threads, and what we see is all 4 threads failing on the first event for that thread. With all 4 threads failing, it's likely some kind of initialization failure, possibly a multi-thread race condition.

dan131riley avatar Oct 28 '21 11:10 dan131riley

Thanks @dan131riley I have just run with 4 Threads but no error... :-( runTheMatrix.py -l 10801.0 -i all --ibeos -t 4 It looks very event dependent..

jfernan2 avatar Oct 28 '21 12:10 jfernan2

More likely timing dependent. It's an all or none failure--either all the streams fail or none do, that's not consistent with an event dependent failure. Thread races can be very dependent on the system load, and the IB machines tend to be heavily loaded.

dan131riley avatar Oct 28 '21 12:10 dan131riley

Ok, I understand, but that makes even harder to reproduce...

jfernan2 avatar Oct 28 '21 14:10 jfernan2

Probably related, in https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/slc7_amd64_gcc900/CMSSW_12_2_DEVEL_X_2021-11-04-2300/pyRelValMatrixLogs/run/10004.0_SingleGammaPt10+2017+SingleGammaPt10_pythia8_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano/step6_SingleGammaPt10+2017+SingleGammaPt10_pythia8_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano.log

----- Begin Fatal Exception 05-Nov-2021 07:23:16 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 6 stream: 2
   [1] Running path 'dqmoffline_step'
   [2] Prefetching for module NanoAODDQM/'nanoDQMMC'
   [3] Calling method for module SimpleGenEventFlatTableProducer/'genTable'
Exception Message:
A std::exception was thrown.
no method or data member named "hasBinningValues" found for type "GenEventInfoProduct"
----- End Fatal Exception -------------------------------------------------

dan131riley avatar Nov 05 '21 13:11 dan131riley

Thanks @dan131riley That is stranger since the method (even if it is not a DQM class) exists: https://github.com/cms-sw/cmssw/blob/6d2f66057131baacc2fcbdd203588c41c885b42c/SimDataFormats/GeneratorProducts/interface/GenEventInfoProduct.h#L49 So, I do not understand

jfernan2 avatar Nov 05 '21 14:11 jfernan2

+1 I am still not able to reproduce in CMSSW_12_3_ROOT624_X_2021-12-10-2300 If you think this issue is still alive please let me know Thanks

jfernan2 avatar Dec 13 '21 08:12 jfernan2

Occurred in CMSSW_12_3_X_2021-12-13-2300 slc7_ppc64le_gcc11

----- Begin Fatal Exception 14-Dec-2021 10:15:47 CET-----------------------
An exception of category 'Configuration' occurred while
   [0] Processing  Event run: 1 lumi: 2 event: 103 stream: 0
   [1] Running path 'dqmoffline_3_step'
   [2] Calling method for module NanoAODDQM/'nanoDQMMC'
Exception Message:
Cut parser error:no method or data member named "getAnyValue" found for type "nanoaod::FlatTable::RowView" (char 4)
----- End Fatal Exception -------------------------------------------------

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_ppc64le_gcc11/CMSSW_12_3_X_2021-12-13-2300/pyRelValMatrixLogs/run/11834.0_TTbar_14TeV+2021PU+TTbar_14TeV_TuneCP5_GenSimINPUT+DigiPU+RecoNanoPU+HARVESTNanoPU/step3_TTbar_14TeV+2021PU+TTbar_14TeV_TuneCP5_GenSimINPUT+DigiPU+RecoNanoPU+HARVESTNanoPU.log#/423-423

makortel avatar Dec 14 '21 14:12 makortel

Thanks, I am still not able to reproduce in that last example, either in single-thread or in multi-thread..... and without reproducing I cannot debug...

The only thing I know but which I don't understand, is that the crash is coming from: https://github.com/cms-sw/cmssw/blob/master/DataFormats/NanoAOD/interface/FlatTable.h#L94-L101

@gpetruc @peruzzim could you give any clue? Thanks

jfernan2 avatar Dec 14 '21 16:12 jfernan2

-1

jfernan2 avatar Dec 14 '21 16:12 jfernan2

I made https://github.com/cms-sw/cmssw/pull/36501 to add more information to the exception message for the next time it occurs.

makortel avatar Dec 15 '21 00:12 makortel

Occurred in CMSSW_12_3_X_2021-12-24-2300 slc7_ppc64le_gcc11

Rivet.Analysis.HiggsTemplateCrossSections: WARN  Unkown Higgs production mechanism. Cannot classify event. Classification for all events will most likely fail.
----- Begin Fatal Exception 25-Dec-2021 04:35:24 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 3 stream: 1
   [1] Running path 'dqmoffline_step'
   [2] Prefetching for module NanoAODDQM/'nanoDQMMC'
   [3] Calling method for module SimpleGenEventFlatTableProducer/'genTable'
Exception Message:
A std::exception was thrown.
no method or data member named "hasBinningValues" found for type "GenEventInfoProduct"
It has the following methods
and the following data members
 weights_
 signalProcessID_
 qScale_
 alphaQCD_
 alphaQED_
 pdf_
 binningValues_
 DJRValues_
 nMEPartons_
 nMEPartonsFiltered_
----- End Fatal Exception -------------------------------------------------

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_ppc64le_gcc11/CMSSW_12_3_X_2021-12-24-2300/pyRelValMatrixLogs/run/10802.0_SingleElectronPt35+2018+SingleElectronPt35_pythia8_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano/step6_SingleElectronPt35+2018+SingleElectronPt35_pythia8_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano.log#/

makortel avatar Dec 28 '21 14:12 makortel

Occurred in CMSSW_12_3_X_2021-12-27-2300 cs8_ppc64le_gcc11

%MSG
----- Begin Fatal Exception 28-Dec-2021 09:17:49 CET-----------------------
An exception of category 'Configuration' occurred while
   [0] Processing  Event run: 1 lumi: 47 event: 4606 stream: 2
   [1] Running path 'dqmoffline_step'
   [2] Calling method for module NanoAODDQM/'nanoDQMMC'
Exception Message:
Cut parser error:no method or data member named "getAnyValue" found for type "nanoaod::FlatTable::RowView"
It has the following methods
and the following data members
 table_
 row_
 (char 0)
Cut string was getAnyValue("pt") > 15 && abs(getAnyValue("dxy")) < 0.2 && abs(getAnyValue("dz")) < 0.5 && getAnyValue("cutBased") >= 3 && getAnyValue("miniPFRelIso_all") < 0.4
----- End Fatal Exception -------------------------------------------------

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/cs8_ppc64le_gcc11/CMSSW_12_3_X_2021-12-27-2300/pyRelValMatrixLogs/run/10071.0_QCD_FlatPt_15_3000HS_13+2017+QCDForPF_13TeV_TuneCUETP8M1_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano/step6_QCD_FlatPt_15_3000HS_13+2017+QCDForPF_13TeV_TuneCUETP8M1_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano.log#/

makortel avatar Dec 28 '21 15:12 makortel

It seems that both example failures are missing all the methods (and hence causing the exception). The printout comes from https://github.com/cms-sw/cmssw/blob/5cc0e67f56fdf5cf3b0126da9b378733083ae17f/CommonTools/Utils/src/MethodSetter.cc#L127-L133 The loop uses these functions https://github.com/cms-sw/cmssw/blob/5cc0e67f56fdf5cf3b0126da9b378733083ae17f/FWCore/Reflection/src/TypeWithDict.cc#L880-L889 IterWithDict is essentially a wrapper over TIter https://github.com/cms-sw/cmssw/blob/master/FWCore/Reflection/interface/IterWithDict.h https://github.com/cms-sw/cmssw/blob/master/FWCore/Reflection/src/IterWithDict.cc

Given that the TypeDataMembers is able to list the data members correctly, I'd be tempted to conclude that type.getClass() returns a non-nullptr pointer https://github.com/cms-sw/cmssw/blob/5cc0e67f56fdf5cf3b0126da9b378733083ae17f/FWCore/Reflection/src/TypeWithDict.cc#L858 https://github.com/cms-sw/cmssw/blob/5cc0e67f56fdf5cf3b0126da9b378733083ae17f/FWCore/Reflection/src/TypeWithDict.cc#L380-L385

Could there be a race condition in TClass? @pcanal

I think (but did not verify) in both cases the TypeWithDict is constructed from std::type_info, in which case the TypeWithDict::class_ is initialized as https://github.com/cms-sw/cmssw/blob/5cc0e67f56fdf5cf3b0126da9b378733083ae17f/FWCore/Reflection/src/TypeWithDict.cc#L277-L279

makortel avatar Dec 28 '21 15:12 makortel

Occurred in CMSSW_12_3_GEANT4_X_2021-12-28-2300 https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc10/CMSSW_12_3_GEANT4_X_2021-12-28-2300/pyRelValMatrixLogs/run/10842.0_ZMM_13+2018+ZMM_13TeV_TuneCUETP8M1_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano/step6_ZMM_13+2018+ZMM_13TeV_TuneCUETP8M1_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano.log#/

makortel avatar Dec 29 '21 14:12 makortel

@makortel have you seen any recent occurence of this ?

vlimant avatar Sep 30 '22 14:09 vlimant

Could there be a race condition in TClass? @pcanal

That's unlikely nowadays but it is of course possible.

in both cases the TypeWithDict is constructed from std::type_info

Then it could be a case of missing dictionary. (not generated or somehow not loaded)

pcanal avatar Sep 30 '22 21:09 pcanal

I don't remember seeing this exception any time recently (but it is possible that I've just forgotten). We have changed ROOT version in between though.

makortel avatar Sep 30 '22 21:09 makortel