cmssw Exception on HLT_DoubleMediumDeepTauPFTauHPS35

I just see relvals failure [1] from [2]. Not sure if this link to DeepTauId issue reported before in https://github.com/cms-sw/cmssw/issues/40437 and https://github.com/cms-sw/cmssw/issues/40733, as the error report looks different. I try to look on 13_1_0_preX relvals report, but I don't see this issue, so I am not clear what condition to make this exception happens.

[1]

An exception of category 'InvalidRun' occurred while
   [0] Processing  Event run: 1 lumi: 13 event: 648 stream: 0
   [1] Running path 'HLT_DoubleMediumDeepTauPFTauHPS35_eta2p1'
   [2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducer'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,86] vs. [207]
	 [[{{node inner_egamma_norm_1/FusedBatchNorm_1/Mul}}]]

[2] https://cmsweb.cern.ch/couchdb/workloadsummary/_design/WorkloadSummary/_show/histogramByWorkflow/pdmvserv_RVCMSSW_13_3_0_pre2ZpToMM_m6000_14TeV__2026D98noPU_RV213_230913_122713_9881 https://cmsweb.cern.ch/couchdb/workloadsummary/_design/WorkloadSummary/_show/histogramByWorkflow/pdmvserv_RVCMSSW_13_3_0_pre2TenTau_15_500_Eta3p1__2026D98noPU_RV213_230913_122651_78

Sep 25 '23 22:09 srimanob

A new Issue was created by @srimanob Phat Srimanobhas.

@Dr15Jones, @rappoccio, @smuzaffar, @makortel, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Sep 25 '23 22:09 cmsbuild

FYI @mmusich @rovere @SohamBhattacharya Not sure if this is a known issue on your side. But I just see it happens on 13_3 relvals, not 13_1 as you are focusing on.

Sep 25 '23 22:09 srimanob

assign upgrade,hlt

Sep 25 '23 22:09 srimanob

New categories assigned: upgrade,hlt

@AdrianoDee,@mmusich,@missirol,@srimanob,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

Sep 25 '23 22:09 cmsbuild

@srimanob thanks for reporting.

Fwiw, the path HLT_DoubleMediumDeepTauPFTauHPS35_eta2p1 was introduced in master (CMSSW_13_3_X) in PR https://github.com/cms-sw/cmssw/pull/42562 and backported in CMSSW_13_1_X in PR https://github.com/cms-sw/cmssw/pull/42649.

@hsert FYI

Sep 26 '23 04:09 mmusich

Shouldn't one use year=2026 instead of 2017? https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33/modules/hltHpsPFTauDeepTauProducer_cfi.py#L14

and also deepTau_2026v2p5_core.pb etc files in here? https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33/modules/hltHpsPFTauDeepTauProducer_cfi.py#L10-L12

Sep 26 '23 08:09 swagata87

Shouldn't one use year=2026 instead of 2017?

https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33/modules/hltHpsPFTauDeepTauProducer_cfi.py#L14

and also deepTau_2026v2p5_core.pb etc files in here?

https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33/modules/hltHpsPFTauDeepTauProducer_cfi.py#L10-L12

I see this is also there in the 13_1 backport. May be @hsert can clarify.

Sep 26 '23 08:09 SohamBhattacharya

I see this is also there in the 13_1 backport.

if this path fires seldom enough it might be randomly failing in a 9k events relval.

Sep 26 '23 08:09 mmusich

Yes, that is correct. It should be 2026. It was integrated in this way to have a preliminary version of the path. The path still requires some improvements and this is one of them. Once we understand the performance issue that we observed, we will work on updating to a new/phase2 DeepTau training.

On 26 Sep 2023, at 11:17, SohamBhattacharya @.@.>> wrote:

Shouldn't one use year=2026 instead of 2017?

https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33/modules/hltHpsPFTauDeepTauProducer_cfi.py#L14

and also deepTau_2026v2p5_core.pbhttps://github.com/cms-data/RecoTauTag-TrainingFiles/blob/master/DeepTauId/deepTau_2026v2p5_core.pb etc files in here?

https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33/modules/hltHpsPFTauDeepTauProducer_cfi.py#L10-L12

I see this is also there in the 13_1 backporthttps://github.com/cms-sw/cmssw/blob/CMSSW_13_1_1/HLTrigger/Configuration/python/HLT_75e33/modules/hltHpsPFTauDeepTauProducer_cfi.py#L14. May be @hserthttps://github.com/hsert can clarify.

— Reply to this email directly, view it on GitHubhttps://github.com/cms-sw/cmssw/issues/42862#issuecomment-1735053327, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AESDY7Z7MDXAJC3UU4V7XJLX4KFSRANCNFSM6AAAAAA5GZMN2A. You are receiving this because you were mentioned.Message ID: @.***>

Sep 26 '23 08:09 hsert

@hsert

Once we understand the performance issue that we observed, we will work on updating to a new/phase2 DeepTau training.

No. We can't let this randomly fail relvals (used by everyone else). Please either fix or disable the path.

Sep 26 '23 08:09 mmusich

I see this is also there in the 13_1 backport.

if this path fires seldom enough it might be randomly failing in a 9k events relval.

We don't see the failure in 13_1 because the last relvals came with CMSSW_13_1_0_pre4. The backport PR was merged in Aug, so we don't see it. I agree with @mmusich if we can't fix at this moment, we should disable the path.

Sep 26 '23 08:09 srimanob

Ok, how urgent it is? I was working on updating it due to fixes needed, then focused on performance improvement since it was suggested in the Phase2 meeting. It may take some days to fix it. If it is urgent, then we can disable it.

Sep 26 '23 08:09 hsert

To me, the next pre-release, CMSSW_13_3_0_pre4, is targeted on 2023/10/17. So if you can't fix it by, let's say, a week before (10 Oct), then please make a PR to disable it. Other may have different comments.

Sep 26 '23 08:09 srimanob

To me, the next pre-release, CMSSW_13_3_0_pre4, is targeted on 2023/10/17. So if you can't fix it by, let's say, a week before (10 Oct), then please make a PR to disable it. Other may have different comments.

Agreed, the path should be disabled for the time being if it crashes RelVals. @hsert If there's no quick fix for this, can you please disable the DeepTau path?

Sep 26 '23 08:09 SohamBhattacharya

If there's no quick fix for this, can you please disable the DeepTau path?

before disabling the path, I'd just like to make sure that the fix proposed at https://github.com/cms-sw/cmssw/issues/42862#issuecomment-1735044691 is not enough.

Sep 26 '23 08:09 mmusich

As shown in slide 18 of https://indico.cern.ch/event/1322372/#sc-2-5-taus, it gives some matrix incompatibility issue. I have another deadline this week, but I can focus on solving the issue next week if it is ok. If that is late, we can disable it. Is there anything that I should do for disabling the path?

Sep 26 '23 09:09 hsert

Is there anything that I should do for disabling the path?

removing any reference from the menu (while keeping the configuration fragment) should be enough.

EDIT: to be fully explicit, remove it from here:

https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33_cff.py#L293

Sep 26 '23 09:09 mmusich

I have another deadline this week, but I can focus on solving the issue next week if it is ok

Next week should be fine. I'll check back here in one week time.

Sep 26 '23 09:09 mmusich

Is there anything that I should do for disabling the path?

removing any reference from the menu (while keeping the configuration fragment) should be enough.

EDIT: to be fully explicit, remove it from here:

https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33_cff.py#L293

Confirm that it's enough to remove it from HLT_75e33_cff.py. The Tau paths are intentionally not there in HLT_75e33_cff_timing.py as that's meant to measure a timing that can be compared to the TDR measurement, which does not include the tau paths.

Sep 26 '23 09:09 SohamBhattacharya

Is there anything that I should do for disabling the path?

removing any reference from the menu (while keeping the configuration fragment) should be enough. EDIT: to be fully explicit, remove it from here: https://github.com/cms-sw/cmssw/blob/a288c8fdd1cbc6f2e0759f8ebbaa9763c76a7e70/HLTrigger/Configuration/python/HLT_75e33_cff.py#L293

Confirm that it's enough to remove it from HLT_75e33_cff.py. The Tau paths are intentionally not there in HLT_75e33_cff_timing.py as that's meant to measure a timing that can be compared to the TDR measurement, which does not include the tau paths.

Once the performance of the Tau Paths is better understood, we can hook those in the Timing version to have an improved baseline.

Sep 26 '23 09:09 rovere

@hsert please update on the status of the fix. Thank you.

Oct 04 '23 06:10 mmusich

I have tried several things, but couldn't figure it out yet. Probably it would be safer to disable the path. Sorry for the inconvenience.

Oct 04 '23 08:10 hsert

I have tried several things, but couldn't figure it out yet. Probably it would be safer to disable the path. Sorry for the inconvenience.

I have created https://github.com/cms-sw/cmssw/pull/42955, just in case. I am trying (unsuccessfully so far) to reproduce the failure offline (@srimanob it would be good if you could get the specs of the nodes in which the relval jobs failed, to see if there's any architecture dependence). If something better doesn't appear before next pre-release deadline, we can go ahead with disabling.

Oct 05 '23 14:10 mmusich

Hi @cms-sw/pdmv-l2 Could you please help to find the spec of node which run this relvals, https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_RVCMSSW_13_3_0_pre3DisplacedSUSY_stopToB_M_800_500mm_13__2026D98noPU_230918_064200_5200

Thx.

Oct 05 '23 14:10 srimanob

I can reproduce the issue with CMSSW_13_3_0_pre3 using slc7_amd64_gcc11 arch. Not test yet in default one, slc8. I manage to keep RAW which causes an issue in HLT step. So, to reproduce the issue, you can just rerun HLT on top of RAW file. Here is my cmsDriver,

cmsDriver.py step3 -s HLT:@relval2026 --conditions auto:phase2_realistic_T25 --datatier GEN-SIM-DIGI-RAW -n -1 --eventcontent FEVTDEBUGHLT --geometry Extended2026D98 --era Phase2C17I13M9 --python step_3_cfg.py --no_exec --filein file:/eos/cms/store/group/offcomp_upgrade-sw/srimanob/phase2/HLT/Tau/step2.root --fileout file:step3.root --customise SLHCUpgradeSimulations/Configuration/aging.customise_aging_1000

Oct 05 '23 14:10 srimanob

With CMSSW_13_3_0_pre3, I see we more failure in relvals, i.e. with CMSSW_13_3_0_pre3DisplacedSUSY

3 jobs from 20 jobs fail with this issue.

Oct 05 '23 14:10 srimanob

I can reproduce the issue with CMSSW_13_3_0_pre3 using slc7_amd64_gcc11 arch.

I can reproduce the issue in these conditions. Interestingly when I run in CMSSW_13_3_X_2023-10-05-1100 on el8_amd64_gcc11 it does NOT reproduce. Even more interestingly at the beginning of the job there is a set of tensorflow warnings which differs in the two architectures:

On el8_amd64_gcc11:

023-10-05 17:25:29.050172: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-05 17:25:29.050519: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: UNKNOWN ERROR (34)
2023-10-05 17:25:29.070838: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled

on slc7_amd64_gcc11 arch:

2023-10-05 17:29:55.090388: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-05 17:30:19.307018: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-05 17:30:19.308508: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: UNKNOWN ERROR (34)
2023-10-05 17:30:19.352316: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled

this is indeed slightly reminiscent of https://github.com/cms-sw/cmssw/issues/40437 and https://github.com/cms-sw/cmssw/issues/40733.

@cms-sw/tau-pog-l2 and @VinInn FYI.

Oct 05 '23 15:10 mmusich

Here is what I get from el8, with CMSSW_13_3_0_pre3:

2023-10-05 17:34:42.609372: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-05 17:34:42.609932: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: UNKNOWN ERROR (34)
2023-10-05 17:34:42.648562: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled

and the same issue appear,

Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,64] vs. [154]
	 [[{{node inner_muon_norm_1/FusedBatchNorm_1/Mul}}]]

Oct 05 '23 15:10 srimanob

Here is what I get from el8, with CMSSW_13_3_0_pre3:

for me on lxplus806 also this combination runs OK. Not sure how to tell the cpu microarchitecture from the lxplus node.

Oct 05 '23 15:10 mmusich

Not sure how to tell the cpu microarchitecture from the lxplus node.

OK, for the record it can be obtained running cmsRun -e step_3_cfg.py and then inspecting the FrameworkJobReport.xml. On lxplus806, in which the job succeeds ( in CMSSW_13_3_0_pre3 with el8_amd64_gcc11):

<Metric Name="CPUModels" Value="Intel Core Processor (Broadwell, IBRS)"/>

On lxplus761, in which the job fails ( in CMSSW_13_3_0_pre3 with slc7_amd64_gcc11 ) :

<Metric Name="CPUModels" Value="Intel Xeon Processor (Skylake, IBRS)"/>

Oct 05 '23 16:10 mmusich

Exception on HLT_DoubleMediumDeepTauPFTauHPS35_eta2p1 Phase-2 workflow