Exception on HLT_DoubleMediumDeepTauPFTauHPS35_eta2p1 Phase-2 workflow
I just see relvals failure [1] from [2]. Not sure if this link to DeepTauId issue reported before in https://github.com/cms-sw/cmssw/issues/40437 and https://github.com/cms-sw/cmssw/issues/40733, as the error report looks different. I try to look on 13_1_0_preX relvals report, but I don't see this issue, so I am not clear what condition to make this exception happens.
[1]
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 1 lumi: 13 event: 648 stream: 0
[1] Running path 'HLT_DoubleMediumDeepTauPFTauHPS35_eta2p1'
[2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducer'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,86] vs. [207]
[[{{node inner_egamma_norm_1/FusedBatchNorm_1/Mul}}]]
[2] https://cmsweb.cern.ch/couchdb/workloadsummary/_design/WorkloadSummary/_show/histogramByWorkflow/pdmvserv_RVCMSSW_13_3_0_pre2ZpToMM_m6000_14TeV__2026D98noPU_RV213_230913_122713_9881 https://cmsweb.cern.ch/couchdb/workloadsummary/_design/WorkloadSummary/_show/histogramByWorkflow/pdmvserv_RVCMSSW_13_3_0_pre2TenTau_15_500_Eta3p1__2026D98noPU_RV213_230913_122651_78
A new Issue was created by @srimanob Phat Srimanobhas.
@Dr15Jones, @rappoccio, @smuzaffar, @makortel, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
FYI @mmusich @rovere @SohamBhattacharya Not sure if this is a known issue on your side. But I just see it happens on 13_3 relvals, not 13_1 as you are focusing on.
assign upgrade,hlt
New categories assigned: upgrade,hlt
@AdrianoDee,@mmusich,@missirol,@srimanob,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks
@srimanob thanks for reporting.
Fwiw, the path HLT_DoubleMediumDeepTauPFTauHPS35_eta2p1 was introduced in master (CMSSW_13_3_X) in PR https://github.com/cms-sw/cmssw/pull/42562 and backported in CMSSW_13_1_X in PR https://github.com/cms-sw/cmssw/pull/42649.
@hsert FYI
Shouldn't one use year=2026 instead of 2017?
https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33/modules/hltHpsPFTauDeepTauProducer_cfi.py#L14
and also deepTau_2026v2p5_core.pb etc files in here? https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33/modules/hltHpsPFTauDeepTauProducer_cfi.py#L10-L12
Shouldn't one use
year=2026instead of2017?https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33/modules/hltHpsPFTauDeepTauProducer_cfi.py#L14
and also deepTau_2026v2p5_core.pb etc files in here?
https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33/modules/hltHpsPFTauDeepTauProducer_cfi.py#L10-L12
I see this is also there in the 13_1 backport. May be @hsert can clarify.
I see this is also there in the 13_1 backport.
if this path fires seldom enough it might be randomly failing in a 9k events relval.
Yes, that is correct. It should be 2026. It was integrated in this way to have a preliminary version of the path. The path still requires some improvements and this is one of them. Once we understand the performance issue that we observed, we will work on updating to a new/phase2 DeepTau training.
On 26 Sep 2023, at 11:17, SohamBhattacharya @.@.>> wrote:
Shouldn't one use year=2026 instead of 2017?
https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33/modules/hltHpsPFTauDeepTauProducer_cfi.py#L14
and also deepTau_2026v2p5_core.pbhttps://github.com/cms-data/RecoTauTag-TrainingFiles/blob/master/DeepTauId/deepTau_2026v2p5_core.pb etc files in here?
https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33/modules/hltHpsPFTauDeepTauProducer_cfi.py#L10-L12
I see this is also there in the 13_1 backporthttps://github.com/cms-sw/cmssw/blob/CMSSW_13_1_1/HLTrigger/Configuration/python/HLT_75e33/modules/hltHpsPFTauDeepTauProducer_cfi.py#L14. May be @hserthttps://github.com/hsert can clarify.
— Reply to this email directly, view it on GitHubhttps://github.com/cms-sw/cmssw/issues/42862#issuecomment-1735053327, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AESDY7Z7MDXAJC3UU4V7XJLX4KFSRANCNFSM6AAAAAA5GZMN2A. You are receiving this because you were mentioned.Message ID: @.***>
@hsert
Once we understand the performance issue that we observed, we will work on updating to a new/phase2 DeepTau training.
No. We can't let this randomly fail relvals (used by everyone else). Please either fix or disable the path.
I see this is also there in the 13_1 backport.
if this path fires seldom enough it might be randomly failing in a 9k events relval.
We don't see the failure in 13_1 because the last relvals came with CMSSW_13_1_0_pre4. The backport PR was merged in Aug, so we don't see it. I agree with @mmusich if we can't fix at this moment, we should disable the path.
Ok, how urgent it is? I was working on updating it due to fixes needed, then focused on performance improvement since it was suggested in the Phase2 meeting. It may take some days to fix it. If it is urgent, then we can disable it.
To me, the next pre-release, CMSSW_13_3_0_pre4, is targeted on 2023/10/17. So if you can't fix it by, let's say, a week before (10 Oct), then please make a PR to disable it. Other may have different comments.
To me, the next pre-release, CMSSW_13_3_0_pre4, is targeted on 2023/10/17. So if you can't fix it by, let's say, a week before (10 Oct), then please make a PR to disable it. Other may have different comments.
Agreed, the path should be disabled for the time being if it crashes RelVals. @hsert If there's no quick fix for this, can you please disable the DeepTau path?
If there's no quick fix for this, can you please disable the DeepTau path?
before disabling the path, I'd just like to make sure that the fix proposed at https://github.com/cms-sw/cmssw/issues/42862#issuecomment-1735044691 is not enough.
As shown in slide 18 of https://indico.cern.ch/event/1322372/#sc-2-5-taus, it gives some matrix incompatibility issue. I have another deadline this week, but I can focus on solving the issue next week if it is ok. If that is late, we can disable it. Is there anything that I should do for disabling the path?
Is there anything that I should do for disabling the path?
removing any reference from the menu (while keeping the configuration fragment) should be enough.
EDIT: to be fully explicit, remove it from here:
https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33_cff.py#L293
I have another deadline this week, but I can focus on solving the issue next week if it is ok
Next week should be fine. I'll check back here in one week time.
Is there anything that I should do for disabling the path?
removing any reference from the menu (while keeping the configuration fragment) should be enough.
EDIT: to be fully explicit, remove it from here:
https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33_cff.py#L293
Confirm that it's enough to remove it from HLT_75e33_cff.py. The Tau paths are intentionally not there in HLT_75e33_cff_timing.py as that's meant to measure a timing that can be compared to the TDR measurement, which does not include the tau paths.
Is there anything that I should do for disabling the path?
removing any reference from the menu (while keeping the configuration fragment) should be enough. EDIT: to be fully explicit, remove it from here: https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33_cff.py#L293
Confirm that it's enough to remove it from
HLT_75e33_cff.py. The Tau paths are intentionally not there inHLT_75e33_cff_timing.pyas that's meant to measure a timing that can be compared to the TDR measurement, which does not include the tau paths.
Once the performance of the Tau Paths is better understood, we can hook those in the Timing version to have an improved baseline.
@hsert please update on the status of the fix. Thank you.
I have tried several things, but couldn't figure it out yet. Probably it would be safer to disable the path. Sorry for the inconvenience.
I have tried several things, but couldn't figure it out yet. Probably it would be safer to disable the path. Sorry for the inconvenience.
I have created https://github.com/cms-sw/cmssw/pull/42955, just in case. I am trying (unsuccessfully so far) to reproduce the failure offline (@srimanob it would be good if you could get the specs of the nodes in which the relval jobs failed, to see if there's any architecture dependence). If something better doesn't appear before next pre-release deadline, we can go ahead with disabling.
Hi @cms-sw/pdmv-l2 Could you please help to find the spec of node which run this relvals, https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_RVCMSSW_13_3_0_pre3DisplacedSUSY_stopToB_M_800_500mm_13__2026D98noPU_230918_064200_5200
Thx.
I can reproduce the issue with CMSSW_13_3_0_pre3 using slc7_amd64_gcc11 arch. Not test yet in default one, slc8.
I manage to keep RAW which causes an issue in HLT step. So, to reproduce the issue, you can just rerun HLT on top of RAW file. Here is my cmsDriver,
cmsDriver.py step3 -s HLT:@relval2026 --conditions auto:phase2_realistic_T25 --datatier GEN-SIM-DIGI-RAW -n -1 --eventcontent FEVTDEBUGHLT --geometry Extended2026D98 --era Phase2C17I13M9 --python step_3_cfg.py --no_exec --filein file:/eos/cms/store/group/offcomp_upgrade-sw/srimanob/phase2/HLT/Tau/step2.root --fileout file:step3.root --customise SLHCUpgradeSimulations/Configuration/aging.customise_aging_1000
With CMSSW_13_3_0_pre3, I see we more failure in relvals, i.e. with CMSSW_13_3_0_pre3DisplacedSUSY
3 jobs from 20 jobs fail with this issue.
I can reproduce the issue with CMSSW_13_3_0_pre3 using
slc7_amd64_gcc11arch.
I can reproduce the issue in these conditions. Interestingly when I run in CMSSW_13_3_X_2023-10-05-1100 on el8_amd64_gcc11 it does NOT reproduce.
Even more interestingly at the beginning of the job there is a set of tensorflow warnings which differs in the two architectures:
On el8_amd64_gcc11:
023-10-05 17:25:29.050172: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-05 17:25:29.050519: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: UNKNOWN ERROR (34)
2023-10-05 17:25:29.070838: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
on slc7_amd64_gcc11 arch:
2023-10-05 17:29:55.090388: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-05 17:30:19.307018: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-05 17:30:19.308508: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: UNKNOWN ERROR (34)
2023-10-05 17:30:19.352316: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
this is indeed slightly reminiscent of https://github.com/cms-sw/cmssw/issues/40437 and https://github.com/cms-sw/cmssw/issues/40733.
@cms-sw/tau-pog-l2 and @VinInn FYI.
Here is what I get from el8, with CMSSW_13_3_0_pre3:
2023-10-05 17:34:42.609372: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-05 17:34:42.609932: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: UNKNOWN ERROR (34)
2023-10-05 17:34:42.648562: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
and the same issue appear,
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,64] vs. [154]
[[{{node inner_muon_norm_1/FusedBatchNorm_1/Mul}}]]
Here is what I get from el8, with CMSSW_13_3_0_pre3:
for me on lxplus806 also this combination runs OK. Not sure how to tell the cpu microarchitecture from the lxplus node.
Not sure how to tell the cpu microarchitecture from the lxplus node.
OK, for the record it can be obtained running cmsRun -e step_3_cfg.py and then inspecting the FrameworkJobReport.xml.
On lxplus806, in which the job succeeds ( in CMSSW_13_3_0_pre3 with el8_amd64_gcc11):
<Metric Name="CPUModels" Value="Intel Core Processor (Broadwell, IBRS)"/>
On lxplus761, in which the job fails ( in CMSSW_13_3_0_pre3 with slc7_amd64_gcc11 ) :
<Metric Name="CPUModels" Value="Intel Xeon Processor (Skylake, IBRS)"/>