cmssw icon indicating copy to clipboard operation
cmssw copied to clipboard

Failures at Nano LHEScaleSumw merge failed compatibility

Open sunilUIET opened this issue 1 year ago • 48 comments

Hi,

Recently, we have been observing failures at Nano step with many MC Production WFs as


An exception of category 'LogicError' occurred while [0] Calling InputSource::readRun_ Exception Message: Trying to merge LHEScaleSumw with LHEScaleSumw failed the compatibility test.


Failure is random and the percentage of failure varies across WFs, sometimes more 10-20%. Example WFs are

https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_TOP-Run3Summer22wmLHEGS-00027 https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_HIG-Run3Summer22EEwmLHEGS-00330

sunilUIET avatar Jan 25 '24 07:01 sunilUIET

cms-bot internal usage

cmsbuild avatar Jan 25 '24 07:01 cmsbuild

A new Issue was created by @sunilUIET sunil bansal.

@antoniovilela, @makortel, @smuzaffar, @sextonkennedy, @Dr15Jones, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

cmsbuild avatar Jan 25 '24 07:01 cmsbuild

assign xpog

makortel avatar Jan 25 '24 14:01 makortel

New categories assigned: xpog

@vlimant,@hqucms you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild avatar Jan 25 '24 14:01 cmsbuild

can one please pull out two files to merge that lead to this failure?

vlimant avatar Jan 26 '24 08:01 vlimant

I will let someone from PnR to comment if we can get such list @z4027163

sunilUIET avatar Jan 26 '24 08:01 sunilUIET

There was an upgrade in dCache and some FNAL files were lost. FNAL is trying to verify what was damaged, meanwhile, I can speed this up by invalidating the replicas that you find

amanrique1 avatar Jan 26 '24 15:01 amanrique1

We have an example current in the production system: task_TOP-Run3Summer22wmLHEGS-00042 The full list of unmerged files can be found in this error report: https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610 (after the sentence "1562 Files in no block for TOP-Run3Summer22NanoAODv12-00020_0MergeNANOEDMAODSIMoutput".

Here are some examples: /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/002d042a-632e-4ebb-a018-0d5ef283ce6b.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/005c3d94-b6b9-462f-b9f4-d3551e4d4b71.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/006e8d8a-3c73-4125-bfdf-efde00c7b4ca.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0112efa2-de05-41ea-b4de-683d6a860caf.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0151db4c-458d-4f79-bf2c-f66eb5fc9642.root @ T2_US_MIT /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01996c93-8d74-4972-86c7-69bb6692ec22.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01bc4538-21cd-4fcf-8bec-2a43be7a0f13.root @ T1_US_FNAL_Disk

@sunilUIET @vlimant FYI

z4027163 avatar Jan 30 '24 03:01 z4027163

I tested the merge process in 13_0_13 using python3 $CMSSW_RELEASE_BASE/src/Configuration/DataProcessing/test/RunMerge.py --output-file out.root --mergeNANO --input-file /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/84cdf48b-eb83-44e5-8127-e10a940c6ae2.root,/store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/85668cee-7f91-496a-8cbe-7aa9c3c75fdb.root,/store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/857aff8a-2164-430f-a311-3ea4237842aa.root then cmsRun -j FrameworkJobReport.xml RunMergeCfg.py ; and could reproduce the crash indeed.

vlimant avatar Jan 30 '24 08:01 vlimant

one file has 35 reweighting weights and the other 34 ; hence the unmergability. Is there a way to get the MINI and AOD files that led to the creation of the two files above ?

the size of the weights comes from a regexp over the LHE information : https://github.com/cms-sw/cmssw/blob/CMSSW_13_0_X/PhysicsTools/NanoAOD/plugins/GenWeightsTableProducer.cc#L905C36-L905C52 ; @cms-sw/generators-l2 how can the number of weights vary from one job to the other ?

vlimant avatar Jan 30 '24 09:01 vlimant

@vlimant we discussed this with PdmV too, now, we will have a look how this happened, a priori, I have no idea as this was not the case before. Will report back asap.

@menglu21 FYI.

bbilin avatar Jan 31 '24 13:01 bbilin

is it possible that the MG fails to compute on of the weight set and therefore does not include it for one specific job ? leading to a file, from that job, with only a subset of the weights ; files that cannot be merge with others later on because of the size difference.

vlimant avatar Jan 31 '24 13:01 vlimant

@bbilin @menglu21 do you have any news on the issue? Number of affected WFs is increasing, so we need to understand as soon as possible for the fix.

Thanks

sunilUIET avatar Feb 13 '24 10:02 sunilUIET

Hi all, please let me know if you still need those files. We would like to announce this WF if those files are not needed anymore.

z4027163 avatar Feb 16 '24 23:02 z4027163

@sunilUIET : please provide a list of the samples that exhibit this failure.

vlimant avatar Mar 04 '24 09:03 vlimant

@vlimant here is the list (few weeks back) provided by PnR. @z4027163 can add if he has more complete list

sunilUIET avatar Mar 04 '24 10:03 sunilUIET

Hi all, I see two ways forward: a.) If we can have the info of seeds from the original wmLHE request of the buggy nano's, we can locally check and see why this happens. b.) We extend runcmsgrid.sh by a line that checks the number of weights against our expectation. if the comparison fails we abort the job. From the failing jobs it should be easy to recover the seed info from the logs. Let me cc other mg5 people @sihyunjeon @cvico @dickychant. @srimanob Do you know if a.) is possible? Anyone else who knows?

agrohsje avatar Mar 04 '24 11:03 agrohsje

Hi all, I see two ways forward: a.) If we can have the info of seeds from the original wmLHE request of the buggy nano's, we can locally check and see why this happens. b.) We extend runcmsgrid.sh by a line that checks the number of weights against our expectation. if the comparison fails we abort the job. From the failing jobs it should be easy to recover the seed info from the logs. Let me cc other mg5 people @sihyunjeon @Cvico @DickyChant. @srimanob Do you know if a.) is possible? Anyone else who knows?

I think a.) seems more important because I just opened an error log 1 which seems to suggest that the error happens at merging nanoaod step. Do we expect this is due to some missing weights?

DickyChant avatar Mar 04 '24 11:03 DickyChant

Yes. A weight entry is missing but it is not clear where this is coming from. So ideally we get the seed that is used for runcmsgrid.sh in the wmLHE step for that specific nano so we can locally reproduce.

agrohsje avatar Mar 04 '24 11:03 agrohsje

Yes. A weight entry is missing but it is not clear where this is coming from. So ideally we get the seed that is used for runcmsgrid.sh in the wmLHE step for that specific nano so we can locally reproduce.

Exactly

A minor question: do we expect this to be reproducible also at NanoGEN level in case we lose the seed and have to start over?

DickyChant avatar Mar 04 '24 11:03 DickyChant

I would guess so. But I think with a small modification of runcmsgrid.sh as proposed above we can also catch it, if indeed we cannot recover seeds of current workflows.

agrohsje avatar Mar 04 '24 11:03 agrohsje

I would guess so. But I think with a small modification of runcmsgrid.sh as proposed above we can also catch it, if indeed we cannot recover seeds of current workflows.

Hope we don't need either ways!

DickyChant avatar Mar 04 '24 11:03 DickyChant

Hi I discussed with @hqucms and checked the corresponding MiniAOD dataset.

So for those files, if we run standard nanov12 sequence from CMSSW_13_0_13, we could already pick up some files that seem to be good (give 35 weights, e.g. /store/mc/Run3Summer22MiniAODv4/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/MINIAODSIM/130X_mcRun3_2022_realistic_v5-v4/2820000/0ec39554-e050-4fdf-96dd-b143efb9cdd2.root) and some files that seem to be bad (give 34 weights, e.g. /store/mc/Run3Summer22MiniAODv4/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/MINIAODSIM/130X_mcRun3_2022_realistic_v5-v4/2820000/90f2a55d-a0cc-43b8-be1f-f51594bdc950.root).

Taking these two files as examples, if we retrieve the miniaod files and check the LHE Run Info head

# With fwlite
lhehandle = Handle("LHERunInfoProduct")
test_run.getByLabel("externalLHEProducer",lhehandle)
lheruninfo = lhehandle.product() # here you get a list of strings that forms the `XML` LHE header

The relevant output (i.e. the part with reweighting weights) are

  1. For good file:
<weightgroup name="mg_reweighting" weight_name_strategy="includeIdInWeightName">
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo"/>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_m1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_m1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_m1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 13 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_m1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_m1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 23 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_1p_nlo">set param_card dim62f 23 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_m1p_ctg_0p_nlo">set param_card dim62f 19 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_1p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_m1p_nlo">set param_card dim62f 24 -1.0 # orig: 1e-05
</weight>
</weightgroup>
  1. For bad file:
<weightgroup name="mg_reweighting" weight_name_strategy="includeIdInWeightName">
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo"/>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_m1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_m1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_m1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 13 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_m1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_m1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 23 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo"/>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_1p_nlo">set param_card dim62f 23 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_m1p_ctg_0p_nlo">set param_card dim62f 19 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_1p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_m1p_nlo">set param_card dim62f 24 -1.0 # orig: 1e-05
</weight>
</weightgroup>

Let me pick up the one line that has difference:

  1. good: <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05 set param_card dim62f 23 1.0 # orig: 1e-05 </weight>
  2. bad: <weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo"/>

It is obvious that both are good xml syntax, while unfortunately our GEN weight Nano parser only accepts good one :)

Relevant code: https://github.com/cms-sw/cmssw/blob/b4572d430a07a0a38f665556c54b7e87379065db/PhysicsTools/NanoAOD/plugins/GenWeightsTableProducer.cc#L595

We need to understand why this happens (@agrohsje and @sihyunjeon please correct me but I don't think this happens without reweighting launch names...?) I might need to look at madgraph source code to understand...

But what also obvious is: the very first line from both files are also not parsable to our Nano weight parser :) Therefore if you counting the weights there are 36 values while we just have either 35 or 34 entries after NANO them.

If we look at the card, at least I think it should have 36 weights.

DickyChant avatar Mar 06 '24 01:03 DickyChant

Cannot reproduce the MG5 header with the same random seed and nevents from the bad file....

DickyChant avatar Mar 06 '24 02:03 DickyChant

hey interesting finding. just based on quick scanning, maybe this part https://github.com/mg5amcnlo/mg5amcnlo/blob/LTS/models/check_param_card.py#L572-L575 is not working as expected?

For the first lines that we are dropping, at least in madgraph internal it works as it is encoded - as the parameters are the same as the original values set in customization card it prints out nothing. But what is weird is the difference between good and bad file ...

On a separate note, though, that tWZ sample should've not been submitted in the first place since madspin+reweight was already found to be wonky IIRC (will open an issue on this in https://github.com/cms-sw/genproductions)

sihyunjeon avatar Mar 06 '24 07:03 sihyunjeon

python3 $CMSSW_RELEASE_BASE/src/Configuration/DataProcessing/test/RunMerge.py --output-file out.root --mergeNANO --input-file

Actually the issue itself is more tricky than this as the relevant bits are:

https://github.com/mg5amcnlo/mg5amcnlo/blob/59b4b9c1238978f39a32b8bc83244328187704b6/madgraph/interface/reweight_interface.py#L870C1-L883C80

For which you clearly see that it is supposed to be always producing <weight> </weight> syntax. (v265 has slightly different code content but what has been done there is similar, one can easily check this out from untar the gridpack and check this file in the mg5basesdir)

I think the other VHH sample is also influenced which doesn't have anything todo with the madspin+reweighting issue.

To me, the quicker (and uglier) solution is to fix the regex pattern we've been using (I don't know if this is a fix because from madgraph source code one would never expect there could be another possible output syntax).

The better solution that works for long term is to leverage existing xml parser without reinventing the wheel. (like what we did for LHEInterface and Kenneth's PR on refactoring genweighttable if I don't remember things wrongly?)

DickyChant avatar Mar 06 '24 08:03 DickyChant

For which you clearly see that it is supposed to be always producing syntax.

So somewhere this /weight> is getting dropped and making /> which i don't understand...

I think the other VHH sample is also influenced which doesn't have anything todo with the madspin+reweighting issue.

Yes that's why i said it's a "separate note"

sihyunjeon avatar Mar 06 '24 08:03 sihyunjeon

A lot of useful and confusing info in that thread. Let me catch up: 1.) You connect mini and nano: Did you find the name of the mini input files in the logs of the corrupted nano? Do you have a link? 2.) How did you recover the seed of the wmLHE step? 3.) Do we still have the logs of the wmLHE step? We can fix the regex but I am really worried that the same code executed on different machines produces different output.

agrohsje avatar Mar 06 '24 09:03 agrohsje

A lot of useful and confusing info in that thread. Let me catch up: 1.) You connect mini and nano: Did you find the name of the mini input files in the logs of the corrupted nano? Do you have a link? 2.) How did you recover the seed of the wmLHE step? 3.) Do we still have the logs of the wmLHE step? We can fix the regex but I am really worried that the same code executed on different machines produces different output.

(1): I chatted with @hqucms and we both just thought about running with published miniaods (the published miniaod dataset has ~ 1M events, while the corresponding nano is just 10k so we believed there are buggy files and luckily there are some) I just did condor jobs that runs standard nano sequence and check the merge compatibility after having the nano files and pick up the miniaod that gives good and bad nano output lol (2): The seed and number of events I got is from the header! Since madgraph running would store the run_card in the header of LHE files. (3): Unfortuanately no and I cannot reproduce anything it seems... But I might omit something... I do have the feeling that I did see similar error again but once I modify the mgbasedir codes to verify my hypothesis on the functional part the error disappeared...

DickyChant avatar Mar 06 '24 09:03 DickyChant

hmmm @DickyChant were you able to find other buggy cases? i am wondering if the bug always affects the same weight block ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo in this twz sample

sihyunjeon avatar Mar 06 '24 09:03 sihyunjeon