HLT crash in run 359998: Unavailable Conditions of type HcalChannelQuality
Crash in Run 359998 http://cmsonline.cern.ch/cms-elog/1159020
with following message:
[2] Calling method for module CaloTowersCreator/'hltTowerMakerForAll'
Exception Message:
Unavailable Conditions of type HcalChannelQuality for cell (0x0)
Unfortunately not reproducible yet. The file reconverted to ROOT is
/nfshome0/hltpro/hilton_c2e36_35_04/hltpro/thiagoScratch/run359998_ls0335.root,
and the relevant configurations are:
- CMSSW_12_4_9
- GT: 124X_dataRun3_HLT_v4
- /cdaq/physics/Run2022/2e34/v1.4.0/HLT/V10
A copy of the configuration file is available in /nfshome0/hltpro/hilton_c2e36_35_04/hltpro/thiagoScratch/hlt.py
A new Issue was created by @trtomei Thiago Tomei.
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
From the alca point of view I can confirm that the GT 124X_dataRun3_HLT_v4 has been online for a while now and the tag HcalChannelQuality_v2.0_hlt was last modified on 2022-08-03, so i don't see any clear reason for this failure.
Maybe someone from @cms-sw/hcal-dpg-l2 can comment on the cell (0x0)?
assign hcal-dpg
FYI: @cms-sw/hlt-l2 @silviodonato
New categories assigned: hcal-dpg
@wang-hui,@georgia14,@igv4321 you have been requested to review this Pull request/Issue and eventually sign? Thanks
For the record, this online crash happened more than once (and it does not seem to be reproducible offline). Affected runs (afaik):
357898
359998
@trtomei , please update the title of the issue with something like "HLT crash in run 359998: ...".
Hi @trtomei Could you please copy the config file to lxplus so that we HCAL DPG can try to reproduce the crash?
Hi @wang-hui The files are available in /afs/cern.ch/user/t/tomei/public/issue39693 now!
Hello @cms-sw/hcal-dpg-l2 @cms-sw/alca-l2 , this crash happened again this night in run 360295. HLT was using CMSSW_12_4_10.
The crash happened 5 times (2022-10-13):
- 1 time at 09:16:07 (
fu-c2b03-33-01) - 1 time at 08:27:12 (
fu-c2b03-12-01) - 3 time around 05:40
- 05:39:07 (
fu-c2b03-33-01) - 05:41:26 (
fu-c2b03-30-01) - 05:44:57 (
fu-c2b02-03-01)
- 05:39:07 (
f3mon_logtable_2022-10-13T07_53_34.976Z.txt
List of runs with the crashes:
357898
359998
360295
In Run 360330
[2] Calling method for module CaloTowersCreator/'hltStoppedHSCPTowerMakerForAll'
Exception Message:
Requested conditions of type HcalChannelQuality for cell (0x45104408) (HE -17,8,1) got conditions for cell (0x0)
Hi @wang-hui The files are available in
/afs/cern.ch/user/t/tomei/public/issue39693now!
this particular event was investigated by @wang-hui offline-cpu code give the following warning
[1] %MSG-w HBHEDigi: HBHEPhase1Reconstructor:hltHbherecoLegacy 13-Oct-2022 21:36:29 CEST Run: 359998 Event: 538274608
bad SOI/maxTS in cell (HB 10,47,3)
expect maxTS >= 3 && soi > 0 && soi < maxTS - 1
got maxTS = 8, SOI = -1
see here https://github.com/cms-sw/cmssw/blob/master/RecoLocalCalo/HcalRecProducers/src/HBHEPhase1Reconstructor.cc#L520
There is a shift in the SOI, I do not see this condition in the the hlt-gpu code so that's why crash on this. I will implement some fix today so that HLT will not crash.
Of course we need understand why the electronics thinks this rec-hit is shifted of 25ns !
Hello @cms-sw/hcal-dpg-l2 @cms-sw/alca-l2 , this crash happened again this night in run 360295. HLT was using
CMSSW_12_4_10. The crash happened 5 times (2022-10-13):
1 time at 09:16:07 (
fu-c2b03-33-01)1 time at 08:27:12 (
fu-c2b03-12-01)3 time around 05:40
- 05:39:07 (
fu-c2b03-33-01)- 05:41:26 (
fu-c2b03-30-01)- 05:44:57 (
fu-c2b02-03-01)f3mon_logtable_2022-10-13T07_53_34.976Z.txt
List of runs with the crashes:
357898 359998 360295
Hi where I can find the events here ? would be good to have these events copied somewhere so that we can classify all these exceptions.
where I can find the events here ?
They are available on the online GPU-development machines, e.g. gpu-c2a02-35-01.cms, at
/store/error_stream/run{357898,359998,360295}/*raw
For an example of how to rerun HLT directly on *.raw files, see https://github.com/cms-sw/cmssw/issues/39045#issuecomment-1214193410.
I do not see this condition in the the hlt-gpu code so that's why crash on this.
FYI: @cms-sw/heterogeneous-l2
Seems to be happening a lot more frequently in recent runs:
Run number: 360330 L1/HLT key: collisions2022/v249 HLT Menu: /cdaq/physics/Run2022/2e34/v1.4.3/HLT/V1 CMSSW version: CMSSW_12_4_10
Most (if not all) of the crashes have the message: [2] Calling method for module CaloTowersCreator/'hltTowerMakerForAll' Exception Message: Requested conditions of type HcalChannelQuality for cell (0x45104407) (HE -17,7,1) got conditions for cell (0x0)
assign heterogeneous
New categories assigned: heterogeneous
@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
Seems to be happening a lot more frequently in recent runs:
720 crashes in run 360330 Run number: 360330 L1/HLT key: collisions2022/v249 HLT Menu: /cdaq/physics/Run2022/2e34/v1.4.3/HLT/V1 CMSSW version: CMSSW_12_4_10
Most (if not all) of the crashes have the message: [2] Calling method for module CaloTowersCreator/'hltTowerMakerForAll' Exception Message: Requested conditions of type HcalChannelQuality for cell (0x45104407) (HE -17,7,1) got conditions for cell (0x0)
Just to add a bit of information on "recent runs":
this morning there was the update of the HCAL conditions with a few hiccups exactly in the run range 360329–360333
(as described in this CMSTalk post) and which consequently caused some processing issues in Tier0 (see this CMSTalk post).
Since the crashes reported in this GH issue pre-date the errors I just described, I think the two things might be un-related, but I just wanted to add the information for completeness.
One thing that I can reproduce is that the soi (what is it ?) computed on GPU is "wrong" for the same event as for the CPU:
Begin processing the 18th record. Run 359998, Event 538274608, LumiSection 335 on stream 0 at 14-Oct-2022 17:38:13.244 CEST
%MSG-w HBHEDigi: HBHEPhase1Reconstructor:hltHbherecoLegacy 14-Oct-2022 17:38:13 CEST Run: 359998 Event: 538274608
bad SOI/maxTS in cell (HB 10,47,3)
expect maxTS >= 3 && soi > 0 && soi < maxTS - 1
got maxTS = 8, SOI = -1
%MSG
and
Begin processing the 18th record. Run 359998, Event 538274608, LumiSection 335 on stream 0 at 14-Oct-2022 17:39:00.806 CEST
gch: 6861, nchannelsf01HE: 4265, nchannelsf015: 4265, soi: -85
gch: 6861, nchannelsf01HE: 4265, nchannelsf015: 4265, soi: -85
gch: 6861, nchannelsf01HE: 4265, nchannelsf015: 4265, soi: -85
gch: 6861, nchannelsf01HE: 4265, nchannelsf015: 4265, soi: -85
gch: 6861, nchannelsf01HE: 4265, nchannelsf015: 4265, soi: -85
gch: 6861, nchannelsf01HE: 4265, nchannelsf015: 4265, soi: -85
gch: 6861, nchannelsf01HE: 4265, nchannelsf015: 4265, soi: -85
gch: 6861, nchannelsf01HE: 4265, nchannelsf015: 4265, soi: -85
----- Begin Fatal Exception 14-Oct-2022 17:39:00 CEST-----------------------
An exception of category 'Conditions not found' occurred while
[0] Processing Event run: 359998 lumi: 335 event: 538274608 stream: 0
[1] Running path 'Path'
[2] Calling method for module CaloTowersCreator/'hltTowerMakerForAll'
Exception Message:
Unavailable Conditions of type HcalChannelQuality for cell (0x0)
----- End Fatal Exception -------------------------------------------------
Looking at the legacy code in RecoLocalCalo/HcalRecProducers/src/HBHEPhase1Reconstructor.cc
- if the
soiis bad, it prints a warning and sets abadSOIflag:const int soi = tsFromDB_ ? properties.paramTs->firstSample() : frame.presamples(); const bool badSOI = !(maxTS >= 3 && soi > 0 && soi < maxTS - 1); if (badSOI) { edm::LogWarning("HBHEDigi") << " bad SOI/maxTS in cell " << cell << "\n expect maxTS >= 3 && soi > 0 && soi < maxTS - 1" << "\n got maxTS = " << maxTS << ", SOI = " << soi; } - the
badSOIflag is set in thechannelInfo:channelInfo->setChannelInfo(cell, pulseShapeID, nTSToCopy, fitSoi, soiCapid, darkCurrent, fcByPE, lambda, noisecorr, hwerr.first, hwerr.second, properties.taggedBadByDb || dropByZS || badSOI); - the last argument of
setChannelInfosets thedropped_flag, which is read via thebool isDropped()method - which in turn makes the producer skip this channel:
// If needed, add the channel info to the output collection const bool makeThisRechit = !channelInfo->isDropped(); [...] // Reconstruct the rechit if (rechits && makeThisRechit) { [...]
Now the question is - how do we skip a "bad" channel in the GPU reconstruction ?
By the way, a source of problems is that soiSamples was uninitialised, hence the random value -87.
By the way, a source of problems is that
soiSampleswas uninitialised, hence the random value-87.
We discussed this issue in today's HCAL DPG meeting. Our OPS colleagues are investigating possible data corruption in the digi. Will let you know if they find something.
OK.
In the meantime, I've prepared what I think is a fix to skip the channels affected by this problem, trying to follow the same approach used in the legacy rechit reconstruction: https://github.com/cms-sw/cmssw/pull/39738 .
+heterogeneous
The same error has been reported in runs 360393 and 360400.
Running with the candidate fix from #39740 lets all HLT jobs complete, with some HCAL-related messages:
%MSG-w HBHEDigi: HBHEPhase1Reconstructor:hltHbherecoLegacy 15-Oct-2022 10:42:05 CEST Run: 360400 Event: 26202582
bad SOI/maxTS in cell (HB -12,28,1)
expect maxTS >= 3 && soi > 0 && soi < maxTS - 1
got maxTS = 8, SOI = -1
%MSG
%MSG-w Invalid Data: HcalRawToDigi:hltHcalDigis 15-Oct-2022 10:42:45 CEST Run: 360400 Event: 44480955
The default QIE11 Collection has 8 samples per digi, while the current data has 17! This data cannot be included with the default collection.
In order to store this data in the event, it must have a unique tag. To accomplish this, provide two lists to HcalRawToDigi
1) that specifies the number of samples and 2) that gives tags with which these data are saved.
For example in this case you might add
process.hcalDigis.saveQIE11DataNSamples = cms.untracked.vint32( 17)
process.hcalDigis.saveQIE11DataTags = cms.untracked.vstring( "MYDATA" )
%MSG
%MSG-w HBHEDigi: HBHEPhase1Reconstructor:hltHbherecoLegacy 15-Oct-2022 10:43:04 CEST Run: 360400 Event: 136235629
bad SOI/maxTS in cell (HB 11,27,1)
expect maxTS >= 3 && soi > 0 && soi < maxTS - 1
got maxTS = 8, SOI = -1
%MSG
%MSG-w HBHEDigi: HBHEPhase1Reconstructor:hltHbherecoLegacy 15-Oct-2022 10:43:04 CEST Run: 360400 Event: 136235629
bad SOI/maxTS in cell (HB -9,30,3)
expect maxTS >= 3 && soi > 0 && soi < maxTS - 1
got maxTS = 8, SOI = -1
%MSG
@cms-sw/hcal-dpg-l2 please consider signing this issue. @trtomei please consider closing this issue (as per today's joint ops document no more issues of this type were noticed in recent runs)
Hi @mmusich the patch of this issue has been merged in #39738. We HCAL DPG are happy with the patch.
@wang-hui then please sign-off this issue. Thanks.
+1
This issue is fully signed and ready to be closed.