DeepBoostedJetTagInfoProducer failure in PromptReco_Run381443_ParkingSingleMuon4 (CMSSW_14_0_7 on AMD arch)
Hello,
There's another PromptReco failure that like #45189 seems to be reproducible on AMD but not on Intel. CMS-talk thread: https://cms-talk.web.cern.ch/t/paused-job-for-promptreco-run381443-parkingsinglemuon4-deepboostedjettaginfoproducer/42164
Exception:
Begin processing the 1st record. Run 381443, Event 2226011497, LumiSection 1038 on stream 0 at 11-Jun-2024 11:17:40.704 CEST
Matched new: [Fatal Exception]
An exception of category 'InvalidReference' occurred while
[0] Processing Event run: 381443 lumi: 1038 event: 2226011497 stream: 0
[1] Running path 'dqmoffline_step'
[2] Prefetching for module ParticleNetJetTagMonitor/'particleNetAK8HbbTagMonitoring'
[3] Prefetching for module BTagProbabilityToDiscriminator/'pfParticleNetAK4DiscriminatorsJetTagsForRECO'
[4] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetAK4JetTagsForRECO'
[5] Calling method for module DeepBoostedJetTagInfoProducer/'pfParticleNetAK4TagInfosForRECO'
Exception Message:
BadRefCore RefCore: Request to resolve a null or invalid reference to a product of type 'std::vector<reco::Vertex>' has been detected.
Please modify the calling code to test validity before dereferencing.
Recipe to reproduce it, on AMD EL8 machine
export SCRAM_ARCH=el8_amd64_gcc12
cmsrel CMSSW_14_0_7
cd CMSSW_14_0_7/src
cmsenv
cp /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/DeepBoostedJet/job/WMTaskSpace/cmsRun1/PSet.pkl .
cat > PSet_one.py <<END
import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
process = pickle.load(handle)
process.source.eventsToProcess = cms.untracked.VEventRange("381443:1038:2226011497",)
process.options.wantSummary = cms.untracked.bool(True)
process.options.numberOfThreads = 1
process.options.numberOfStreams = 1
END
cmsRun PSet_one.py 2>&1 | tee PSet_one.log
cms-bot internal usage
A new Issue was created by @gpetruc.
@rappoccio, @smuzaffar, @makortel, @Dr15Jones, @antoniovilela, @sextonkennedy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign RecoBTag/FeatureTools
New categories assigned: reconstruction
@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks
type pf
type btv
I would like to reproduce this. Anyone have a pointer for finding an AMD machine I can use interactively? Either one running EL8 or one which I can run singularity.
Anyone have a pointer for finding an AMD machine I can use interactively?
e.g on lxplus800:
[musich@lxplus800 ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 16
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7313 16-Core Processor
Stepping: 1
CPU MHz: 3000.134
BogoMIPS: 6000.26
Virtualization: AMD-V
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
L3 cache: 16384K
NUMA node0 CPU(s): 0-15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid arch_capabilities
I tested the reproducer above https://github.com/cms-sw/cmssw/issues/45190#issue-2345880810 fails with:
----- Begin Fatal Exception 17-Jun-2024 19:45:02 CEST-----------------------
An exception of category 'InvalidReference' occurred while
[0] Processing Event run: 381443 lumi: 1038 event: 2226011497 stream: 0
[1] Running path 'dqmoffline_step'
[2] Prefetching for module ParticleNetJetTagMonitor/'particleNetAK8HbbTagMonitoring'
[3] Prefetching for module BTagProbabilityToDiscriminator/'pfParticleNetAK4DiscriminatorsJetTagsForRECO'
[4] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetAK4JetTagsForRECO'
[5] Calling method for module DeepBoostedJetTagInfoProducer/'pfParticleNetAK4TagInfosForRECO'
Exception Message:
BadRefCore RefCore: Request to resolve a null or invalid reference to a product of type 'std::vector<reco::Vertex>' has been detected.
Please modify the calling code to test validity before dereferencing.
----- End Fatal Exception -------------------------------------------------
The code is crashing at this line: https://github.com/cms-sw/cmssw/blob/48adff12e2ce1ef25736f3df39542f5f64046fa1/RecoBTag/FeatureTools/plugins/DeepBoostedJetTagInfoProducer.cc#L684
This appears to be fixed by conditioning that line with:
if(pv_ass.isNonnull())
No clue why this only shows up on AMD though.
in vtx_ass_from_pfcand
there is statement
if (pfcand.trackRef().isNonnull() && pv->trackWeight(pfcand.trackRef()) > 0.5 && pv_ass_quality == 7)
if the first clause fails the others SHALL not be evaluated
adding a cout before the call I get this on INTEL (lxplus806)
std::cout << ">>>> " << icand << ' ' << pv_ass_quality << ' ' << (reco_cand->trackRef().isNonnull() ? "okTk" : "noTk") << (pv_ass.isNonnull() ? "okPV" : "nullPV" )<< std::endl;
>>>> 0 6 okTkokPV
>>>> 1 0 noTknullPV
>>>> 2 2 okTkokPV
>>>> 3 0 noTknullPV
>>>> 4 7 okTkokPV
>>>> 5 6 okTkokPV
>>>> 6 2 okTkokPV
>>>> 7 6 okTkokPV
>>>> 8 0 noTknullPV
>>>> 9 2 okTkokPV
>>>> 10 0 noTknullPV
>>>> 11 0 noTknullPV
>>>> 12 0 noTknullPV
>>>> 13 0 noTknullPV
>>>> 14 0 noTknullPV
>>>> 15 0 noTknullPV
>>>> 16 6 okTkokPV
>>>> 0 1 okTkokPV
>>>> 1 6 okTkokPV
>>>> 2 7 okTkokPV
>>>> 3 7 okTkokPV
>>>> 4 2 okTkokPV
>>>> 5 7 okTkokPV
>>>> 6 0 noTknullPV
>>>> 7 2 okTkokPV
>>>> 8 0 noTknullPV
>>>> 9 0 noTknullPV
>>>> 10 0 noTknullPV
>>>> 11 0 noTknullPV
>>>> 12 0 noTknullPV
>>>> 13 0 noTknullPV
>>>> 14 2 okTkokPV
>>>> 15 6 okTkokPV
each time there is Tk there is aPV as well (and viceversa)
on AMD (lxplus800)
>>>> 0 6 okTkokPV
>>>> 1 0 noTknullPV
>>>> 2 2 okTkokPV
>>>> 3 0 noTknullPV
>>>> 4 7 okTkokPV
>>>> 5 6 okTkokPV
>>>> 6 2 okTkokPV
>>>> 7 6 okTkokPV
>>>> 8 0 noTknullPV
>>>> 9 2 okTkokPV
>>>> 10 0 noTknullPV
>>>> 11 0 noTknullPV
>>>> 12 0 noTknullPV
>>>> 13 0 noTknullPV
>>>> 14 0 noTknullPV
>>>> 15 0 noTknullPV
>>>> 16 6 okTkokPV
>>>> 0 1 okTkokPV
>>>> 1 6 okTkokPV
>>>> 2 7 okTkokPV
>>>> 3 7 okTkokPV
>>>> 4 2 okTkokPV
>>>> 5 7 okTkokPV
>>>> 6 0 noTknullPV
>>>> 7 2 okTkokPV
>>>> 8 0 noTknullPV
>>>> 9 0 noTknullPV
>>>> 10 0 noTknullPV
>>>> 11 0 noTknullPV
>>>> 12 0 noTknullPV
>>>> 13 0 noTknullPV
>>>> 14 2 okTkokPV
>>>> 15 6 okTkokPV
>>>> 16 0 okTknullPV
----- Begin Fatal Exception 19-Jun-2024 15:13:26 CEST-----------------------
An exception of category 'InvalidReference' occurred while
[0] Processing Event run: 381443 lumi: 1038 event: 2226011497 stream: 0
[1] Running path 'dqmoffline_step'
[2] Prefetching for module ParticleNetJetTagMonitor/'particleNetAK8HbbTagMonitoring'
[3] Prefetching for module BTagProbabilityToDiscriminator/'pfParticleNetAK4DiscriminatorsJetTagsForRECO'
[4] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetAK4JetTagsForRECO'
[5] Calling method for module DeepBoostedJetTagInfoProducer/'pfParticleNetAK4TagInfosForRECO'
Exception Message:
BadRefCore RefCore: Request to resolve a null or invalid reference to a product of type 'std::vector<reco::Vertex>' has been detected.
Please modify the calling code to test validity before dereferencing.
so WHO is this 16th (actually 17th) candidate?
I printed the size of the vector and indeed on INTEL is 16 and on AMD is 17... very fishy. It needs full debugging as it is not possible (I suspect a memory issue) valgrind may help
The input jet seems different
In the event there are 123 jets (sic). Jet 2 has 16 constituents on Intel and 17 on AMD. all others have the same number. A spurious pfCand or a difference in the jet algo?
Anyhow this is the protection I suggest to add
diff --git a/RecoBTag/FeatureTools/src/deep_helpers.cc b/RecoBTag/FeatureTools/src/deep_helpers.cc
index 76b443542b3..faf1649d9b8 100644
--- a/RecoBTag/FeatureTools/src/deep_helpers.cc
+++ b/RecoBTag/FeatureTools/src/deep_helpers.cc
@@ -150,7 +150,7 @@ namespace btagbtvdeep {
float vtx_ass_from_pfcand(const reco::PFCandidate &pfcand, int pv_ass_quality, const reco::VertexRef &pv) {
float vtx_ass = pat::PackedCandidate::PVAssociationQuality(qualityMap[pv_ass_quality]);
- if (pfcand.trackRef().isNonnull() && pv->trackWeight(pfcand.trackRef()) > 0.5 && pv_ass_quality == 7) {
+ if (pv.isNonnull() && pfcand.trackRef().isNonnull() && pv->trackWeight(pfcand.trackRef()) > 0.5 && pv_ass_quality == 7) {
vtx_ass = pat::PackedCandidate::UsedInFitTight;
}
return vtx_ass;
Of course there is plenty of possible optimization a bit everywhere
The input file is no more there
Failed to open the file 'root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2024E/ParkingSingleMuon4/RAW/v1/000/381/443/00000/95d48fcb-0633-415c-a5cf-f2caeebab628.root?eos.app=cmst0'
is there a way to recover the input file? I would really like to better understand the origin of the difference btw AMD and INTEL.
is there a way to recover the input file? I would really like to better understand the origin of the difference btw AMD and INTEL.
@germanfgv @LinaresToine please comment.
Should now be available at
/eos/cms/store/data/Run2024E/ParkingSingleMuon4/RAW/v1/000/381/443/00000/95d48fcb-0633-415c-a5cf-f2caeebab628.root
On AMD, the generalTracks collections has 1 more track compared to the Intel case, and the track has the following properties.
pt=0.0130999 eta=-3.36499 phi=-0.951959 ptError=0.0195098 dzError=-nan
IN principle we have removed all "raw" Ofast flags that could produce a difference. Maybe is Tensorflow. I would tag this issue tracking-pog @slava77
IN principle we have removed all "raw" Ofast flags that could produce a difference.
As I recall the evidence was that there are fewer differences between AMD and Intel; there was no evidence that the results become identical.
pt=0.0130999 eta=-3.36499 phi=-0.951959 ptError=0.0195098 dzError=-nan
Is covariance(i_dsz, i_dsz) also nan or is it negative?
pt=0.0130999 eta=-3.36499 phi=-0.951959 ptError=0.0195098 dzError=-nanIs
covariance(i_dsz, i_dsz)also nan or is it negative?
It is negative. Patch in [*] and output below.
XXX pt=0.0130999 eta=-3.36499 phi=-0.951959 dzError=-nan vtxIdMinSignif=-1 covariance(4, 4)=-0.281146
[*]
diff --git a/CommonTools/RecoAlgos/src/PrimaryVertexAssignment.cc b/CommonTools/RecoAlgos/src/PrimaryVertexAssignment.cc
index fad6b30333b..05042d01cca 100644
--- a/CommonTools/RecoAlgos/src/PrimaryVertexAssignment.cc
+++ b/CommonTools/RecoAlgos/src/PrimaryVertexAssignment.cc
@@ -5,6 +5,7 @@
#include "DataFormats/Math/interface/deltaR.h"
#include "TrackingTools/IPTools/interface/IPTools.h"
#include "FWCore/Utilities/interface/isFinite.h"
+#include "FWCore/MessageLogger/interface/MessageLogger.h"
std::pair<int, PrimaryVertexAssignment::Quality> PrimaryVertexAssignment::chargedHadronVertex(
const reco::VertexCollection& vertices,
@@ -184,6 +185,10 @@ std::pair<int, PrimaryVertexAssignment::Quality> PrimaryVertexAssignment::charge
// all other tracks could be non-B secondaries and we just attach them with closest Z
if (vtxIdMinSignif >= 0)
return {vtxIdMinSignif, PrimaryVertexAssignment::OtherDz};
+
+edm::LogPrint("AAAA") << "XXX pt=" << track->pt() << " eta=" << track->eta() << " phi=" << track->phi() << " dzError=" << track->dzError() << " vtxIdMinSignif=" << vtxIdMinSignif
+<< " covariance(4, 4)=" << track->covariance(4, 4);
+
//If for some reason even the dz failed (when?) we consider the track not assigned
return {-1, PrimaryVertexAssignment::Unassigned};
}
Why only on AMD ? (or better: why on INTEL the track is not there at all?)
type tracking
It seems we also now have a different failure that only occurs on AMD: https://github.com/cms-sw/cmssw/issues/45398 Just cross posting it here.
@gpetruc This issue seems to be fixed with this PR. Please close it.
Alternatively @cms-sw/reconstruction-l2 could sign the issue
+q