cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

DeepBoostedJetTagInfoProducer failure in PromptReco_Run381443_ParkingSingleMuon4 (CMSSW_14_0_7 on AMD arch)

Open gpetruc opened this issue 1 year ago • 26 comments

Hello,

There's another PromptReco failure that like #45189 seems to be reproducible on AMD but not on Intel. CMS-talk thread: https://cms-talk.web.cern.ch/t/paused-job-for-promptreco-run381443-parkingsinglemuon4-deepboostedjettaginfoproducer/42164

Exception:

Begin processing the 1st record. Run 381443, Event 2226011497, LumiSection 1038 on stream 0 at 11-Jun-2024 11:17:40.704 CEST
Matched new: [Fatal Exception]
    An exception of category 'InvalidReference' occurred while
       [0] Processing  Event run: 381443 lumi: 1038 event: 2226011497 stream: 0
       [1] Running path 'dqmoffline_step'
       [2] Prefetching for module ParticleNetJetTagMonitor/'particleNetAK8HbbTagMonitoring'
       [3] Prefetching for module BTagProbabilityToDiscriminator/'pfParticleNetAK4DiscriminatorsJetTagsForRECO'
       [4] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetAK4JetTagsForRECO'
       [5] Calling method for module DeepBoostedJetTagInfoProducer/'pfParticleNetAK4TagInfosForRECO'
    Exception Message:
    BadRefCore RefCore: Request to resolve a null or invalid reference to a product of type 'std::vector<reco::Vertex>' has been detected.
    Please modify the calling code to test validity before dereferencing.

Recipe to reproduce it, on AMD EL8 machine

export SCRAM_ARCH=el8_amd64_gcc12
cmsrel CMSSW_14_0_7
cd CMSSW_14_0_7/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/DeepBoostedJet/job/WMTaskSpace/cmsRun1/PSet.pkl .
cat > PSet_one.py <<END
import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)

process.source.eventsToProcess = cms.untracked.VEventRange("381443:1038:2226011497",)
process.options.wantSummary = cms.untracked.bool(True)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1
END
cmsRun PSet_one.py 2>&1 | tee PSet_one.log  

gpetruc avatar Jun 11 '24 09:06 gpetruc

cms-bot internal usage

cmsbuild avatar Jun 11 '24 09:06 cmsbuild

A new Issue was created by @gpetruc.

@rappoccio, @smuzaffar, @makortel, @Dr15Jones, @antoniovilela, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Jun 11 '24 09:06 cmsbuild

assign RecoBTag/FeatureTools

Dr15Jones avatar Jun 11 '24 13:06 Dr15Jones

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Jun 11 '24 13:06 cmsbuild

type pf

jfernan2 avatar Jun 11 '24 14:06 jfernan2

type btv

jfernan2 avatar Jun 11 '24 14:06 jfernan2

I would like to reproduce this. Anyone have a pointer for finding an AMD machine I can use interactively? Either one running EL8 or one which I can run singularity.

mandrenguyen avatar Jun 17 '24 17:06 mandrenguyen

Anyone have a pointer for finding an AMD machine I can use interactively?

e.g on lxplus800:

[musich@lxplus800 ~]$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           16
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7313 16-Core Processor
Stepping:            1
CPU MHz:             3000.134
BogoMIPS:            6000.26
Virtualization:      AMD-V
L1d cache:           64K
L1i cache:           64K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-15
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid arch_capabilities

I tested the reproducer above https://github.com/cms-sw/cmssw/issues/45190#issue-2345880810 fails with:

----- Begin Fatal Exception 17-Jun-2024 19:45:02 CEST-----------------------
An exception of category 'InvalidReference' occurred while
   [0] Processing  Event run: 381443 lumi: 1038 event: 2226011497 stream: 0
   [1] Running path 'dqmoffline_step'
   [2] Prefetching for module ParticleNetJetTagMonitor/'particleNetAK8HbbTagMonitoring'
   [3] Prefetching for module BTagProbabilityToDiscriminator/'pfParticleNetAK4DiscriminatorsJetTagsForRECO'
   [4] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetAK4JetTagsForRECO'
   [5] Calling method for module DeepBoostedJetTagInfoProducer/'pfParticleNetAK4TagInfosForRECO'
Exception Message:
BadRefCore RefCore: Request to resolve a null or invalid reference to a product of type 'std::vector<reco::Vertex>' has been detected.
Please modify the calling code to test validity before dereferencing.
----- End Fatal Exception -------------------------------------------------

mmusich avatar Jun 17 '24 17:06 mmusich

The code is crashing at this line: https://github.com/cms-sw/cmssw/blob/48adff12e2ce1ef25736f3df39542f5f64046fa1/RecoBTag/FeatureTools/plugins/DeepBoostedJetTagInfoProducer.cc#L684

This appears to be fixed by conditioning that line with: if(pv_ass.isNonnull())

No clue why this only shows up on AMD though.

mandrenguyen avatar Jun 17 '24 20:06 mandrenguyen

in vtx_ass_from_pfcand there is statement if (pfcand.trackRef().isNonnull() && pv->trackWeight(pfcand.trackRef()) > 0.5 && pv_ass_quality == 7) if the first clause fails the others SHALL not be evaluated

adding a cout before the call I get this on INTEL (lxplus806) std::cout << ">>>> " << icand << ' ' << pv_ass_quality << ' ' << (reco_cand->trackRef().isNonnull() ? "okTk" : "noTk") << (pv_ass.isNonnull() ? "okPV" : "nullPV" )<< std::endl;

>>>> 0 6 okTkokPV
>>>> 1 0 noTknullPV
>>>> 2 2 okTkokPV
>>>> 3 0 noTknullPV
>>>> 4 7 okTkokPV
>>>> 5 6 okTkokPV
>>>> 6 2 okTkokPV
>>>> 7 6 okTkokPV
>>>> 8 0 noTknullPV
>>>> 9 2 okTkokPV
>>>> 10 0 noTknullPV
>>>> 11 0 noTknullPV
>>>> 12 0 noTknullPV
>>>> 13 0 noTknullPV
>>>> 14 0 noTknullPV
>>>> 15 0 noTknullPV
>>>> 16 6 okTkokPV
>>>> 0 1 okTkokPV
>>>> 1 6 okTkokPV
>>>> 2 7 okTkokPV
>>>> 3 7 okTkokPV
>>>> 4 2 okTkokPV
>>>> 5 7 okTkokPV
>>>> 6 0 noTknullPV
>>>> 7 2 okTkokPV
>>>> 8 0 noTknullPV
>>>> 9 0 noTknullPV
>>>> 10 0 noTknullPV
>>>> 11 0 noTknullPV
>>>> 12 0 noTknullPV
>>>> 13 0 noTknullPV
>>>> 14 2 okTkokPV
>>>> 15 6 okTkokPV

each time there is Tk there is aPV as well (and viceversa)

VinInn avatar Jun 19 '24 13:06 VinInn

on AMD (lxplus800)

>>>> 0 6 okTkokPV
>>>> 1 0 noTknullPV
>>>> 2 2 okTkokPV
>>>> 3 0 noTknullPV
>>>> 4 7 okTkokPV
>>>> 5 6 okTkokPV
>>>> 6 2 okTkokPV
>>>> 7 6 okTkokPV
>>>> 8 0 noTknullPV
>>>> 9 2 okTkokPV
>>>> 10 0 noTknullPV
>>>> 11 0 noTknullPV
>>>> 12 0 noTknullPV
>>>> 13 0 noTknullPV
>>>> 14 0 noTknullPV
>>>> 15 0 noTknullPV
>>>> 16 6 okTkokPV
>>>> 0 1 okTkokPV
>>>> 1 6 okTkokPV
>>>> 2 7 okTkokPV
>>>> 3 7 okTkokPV
>>>> 4 2 okTkokPV
>>>> 5 7 okTkokPV
>>>> 6 0 noTknullPV
>>>> 7 2 okTkokPV
>>>> 8 0 noTknullPV
>>>> 9 0 noTknullPV
>>>> 10 0 noTknullPV
>>>> 11 0 noTknullPV
>>>> 12 0 noTknullPV
>>>> 13 0 noTknullPV
>>>> 14 2 okTkokPV
>>>> 15 6 okTkokPV
>>>> 16 0 okTknullPV
----- Begin Fatal Exception 19-Jun-2024 15:13:26 CEST-----------------------
An exception of category 'InvalidReference' occurred while
   [0] Processing  Event run: 381443 lumi: 1038 event: 2226011497 stream: 0
   [1] Running path 'dqmoffline_step'
   [2] Prefetching for module ParticleNetJetTagMonitor/'particleNetAK8HbbTagMonitoring'
   [3] Prefetching for module BTagProbabilityToDiscriminator/'pfParticleNetAK4DiscriminatorsJetTagsForRECO'
   [4] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetAK4JetTagsForRECO'
   [5] Calling method for module DeepBoostedJetTagInfoProducer/'pfParticleNetAK4TagInfosForRECO'
Exception Message:
BadRefCore RefCore: Request to resolve a null or invalid reference to a product of type 'std::vector<reco::Vertex>' has been detected.
Please modify the calling code to test validity before dereferencing.

so WHO is this 16th (actually 17th) candidate?

VinInn avatar Jun 19 '24 13:06 VinInn

I printed the size of the vector and indeed on INTEL is 16 and on AMD is 17... very fishy. It needs full debugging as it is not possible (I suspect a memory issue) valgrind may help

VinInn avatar Jun 19 '24 13:06 VinInn

The input jet seems different

VinInn avatar Jun 19 '24 14:06 VinInn

In the event there are 123 jets (sic). Jet 2 has 16 constituents on Intel and 17 on AMD. all others have the same number. A spurious pfCand or a difference in the jet algo?

VinInn avatar Jun 20 '24 11:06 VinInn

Anyhow this is the protection I suggest to add

diff --git a/RecoBTag/FeatureTools/src/deep_helpers.cc b/RecoBTag/FeatureTools/src/deep_helpers.cc
index 76b443542b3..faf1649d9b8 100644
--- a/RecoBTag/FeatureTools/src/deep_helpers.cc
+++ b/RecoBTag/FeatureTools/src/deep_helpers.cc
@@ -150,7 +150,7 @@ namespace btagbtvdeep {

   float vtx_ass_from_pfcand(const reco::PFCandidate &pfcand, int pv_ass_quality, const reco::VertexRef &pv) {
     float vtx_ass = pat::PackedCandidate::PVAssociationQuality(qualityMap[pv_ass_quality]);
-    if (pfcand.trackRef().isNonnull() && pv->trackWeight(pfcand.trackRef()) > 0.5 && pv_ass_quality == 7) {
+    if (pv.isNonnull() && pfcand.trackRef().isNonnull() && pv->trackWeight(pfcand.trackRef()) > 0.5 && pv_ass_quality == 7) {
       vtx_ass = pat::PackedCandidate::UsedInFitTight;
     }
     return vtx_ass;

Of course there is plenty of possible optimization a bit everywhere

VinInn avatar Jun 20 '24 11:06 VinInn

The input file is no more there

Failed to open the file 'root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2024E/ParkingSingleMuon4/RAW/v1/000/381/443/00000/95d48fcb-0633-415c-a5cf-f2caeebab628.root?eos.app=cmst0'

VinInn avatar Jun 26 '24 09:06 VinInn

is there a way to recover the input file? I would really like to better understand the origin of the difference btw AMD and INTEL.

VinInn avatar Jun 27 '24 12:06 VinInn

is there a way to recover the input file? I would really like to better understand the origin of the difference btw AMD and INTEL.

@germanfgv @LinaresToine please comment.

mmusich avatar Jun 27 '24 15:06 mmusich

Should now be available at

/eos/cms/store/data/Run2024E/ParkingSingleMuon4/RAW/v1/000/381/443/00000/95d48fcb-0633-415c-a5cf-f2caeebab628.root

missirol avatar Jun 28 '24 13:06 missirol

On AMD, the generalTracks collections has 1 more track compared to the Intel case, and the track has the following properties.

pt=0.0130999 eta=-3.36499 phi=-0.951959 ptError=0.0195098 dzError=-nan

missirol avatar Jun 30 '24 08:06 missirol

IN principle we have removed all "raw" Ofast flags that could produce a difference. Maybe is Tensorflow. I would tag this issue tracking-pog @slava77

VinInn avatar Jun 30 '24 08:06 VinInn

IN principle we have removed all "raw" Ofast flags that could produce a difference.

As I recall the evidence was that there are fewer differences between AMD and Intel; there was no evidence that the results become identical.

slava77 avatar Jul 01 '24 12:07 slava77

pt=0.0130999 eta=-3.36499 phi=-0.951959 ptError=0.0195098 dzError=-nan

Is covariance(i_dsz, i_dsz) also nan or is it negative?

slava77 avatar Jul 01 '24 13:07 slava77

pt=0.0130999 eta=-3.36499 phi=-0.951959 ptError=0.0195098 dzError=-nan

Is covariance(i_dsz, i_dsz) also nan or is it negative?

It is negative. Patch in [*] and output below.

XXX pt=0.0130999 eta=-3.36499 phi=-0.951959 dzError=-nan vtxIdMinSignif=-1 covariance(4, 4)=-0.281146

[*]

diff --git a/CommonTools/RecoAlgos/src/PrimaryVertexAssignment.cc b/CommonTools/RecoAlgos/src/PrimaryVertexAssignment.cc
index fad6b30333b..05042d01cca 100644
--- a/CommonTools/RecoAlgos/src/PrimaryVertexAssignment.cc
+++ b/CommonTools/RecoAlgos/src/PrimaryVertexAssignment.cc
@@ -5,6 +5,7 @@
 #include "DataFormats/Math/interface/deltaR.h"
 #include "TrackingTools/IPTools/interface/IPTools.h"
 #include "FWCore/Utilities/interface/isFinite.h"
+#include "FWCore/MessageLogger/interface/MessageLogger.h"
 
 std::pair<int, PrimaryVertexAssignment::Quality> PrimaryVertexAssignment::chargedHadronVertex(
     const reco::VertexCollection& vertices,
@@ -184,6 +185,10 @@ std::pair<int, PrimaryVertexAssignment::Quality> PrimaryVertexAssignment::charge
   // all other tracks could be non-B secondaries and we just attach them with closest Z
   if (vtxIdMinSignif >= 0)
     return {vtxIdMinSignif, PrimaryVertexAssignment::OtherDz};
+
+edm::LogPrint("AAAA") << "XXX pt=" << track->pt() << " eta=" << track->eta() << " phi=" << track->phi() << " dzError=" << track->dzError() << " vtxIdMinSignif=" << vtxIdMinSignif
+<< " covariance(4, 4)=" << track->covariance(4, 4);
+
   //If for some reason even the dz failed (when?) we consider the track not assigned
   return {-1, PrimaryVertexAssignment::Unassigned};
 }

missirol avatar Jul 01 '24 14:07 missirol

Why only on AMD ? (or better: why on INTEL the track is not there at all?)

VinInn avatar Jul 01 '24 14:07 VinInn

type tracking

slava77 avatar Jul 01 '24 17:07 slava77

It seems we also now have a different failure that only occurs on AMD: https://github.com/cms-sw/cmssw/issues/45398 Just cross posting it here.

mandrenguyen avatar Jul 09 '24 14:07 mandrenguyen

@gpetruc This issue seems to be fixed with this PR. Please close it.

soureek avatar Jul 29 '25 06:07 soureek

Alternatively @cms-sw/reconstruction-l2 could sign the issue

makortel avatar Jul 29 '25 14:07 makortel

+q

jfernan2 avatar Oct 31 '25 10:10 jfernan2