cmssw
cmssw copied to clipboard
Failures at Nano LHEScaleSumw merge failed compatibility
Hi,
Recently, we have been observing failures at Nano step with many MC Production WFs as
An exception of category 'LogicError' occurred while [0] Calling InputSource::readRun_ Exception Message: Trying to merge LHEScaleSumw with LHEScaleSumw failed the compatibility test.
Failure is random and the percentage of failure varies across WFs, sometimes more 10-20%. Example WFs are
https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_TOP-Run3Summer22wmLHEGS-00027 https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_HIG-Run3Summer22EEwmLHEGS-00330
cms-bot internal usage
A new Issue was created by @sunilUIET sunil bansal.
@antoniovilela, @makortel, @smuzaffar, @sextonkennedy, @Dr15Jones, @rappoccio can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign xpog
New categories assigned: xpog
@vlimant,@hqucms you have been requested to review this Pull request/Issue and eventually sign? Thanks
can one please pull out two files to merge that lead to this failure?
I will let someone from PnR to comment if we can get such list @z4027163
There was an upgrade in dCache and some FNAL files were lost. FNAL is trying to verify what was damaged, meanwhile, I can speed this up by invalidating the replicas that you find
We have an example current in the production system: task_TOP-Run3Summer22wmLHEGS-00042 The full list of unmerged files can be found in this error report: https://cms-unified.web.cern.ch/cms-unified/report/cmsunified_task_TOP-Run3Summer22wmLHEGS-00042__v1_T_240129_012657_9610 (after the sentence "1562 Files in no block for TOP-Run3Summer22NanoAODv12-00020_0MergeNANOEDMAODSIMoutput".
Here are some examples: /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/002d042a-632e-4ebb-a018-0d5ef283ce6b.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/005c3d94-b6b9-462f-b9f4-d3551e4d4b71.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/006e8d8a-3c73-4125-bfdf-efde00c7b4ca.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0112efa2-de05-41ea-b4de-683d6a860caf.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/0151db4c-458d-4f79-bf2c-f66eb5fc9642.root @ T2_US_MIT /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01996c93-8d74-4972-86c7-69bb6692ec22.root @ T1_US_FNAL_Disk /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/01bc4538-21cd-4fcf-8bec-2a43be7a0f13.root @ T1_US_FNAL_Disk
@sunilUIET @vlimant FYI
I tested the merge process in 13_0_13 using python3 $CMSSW_RELEASE_BASE/src/Configuration/DataProcessing/test/RunMerge.py --output-file out.root --mergeNANO --input-file /store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/84cdf48b-eb83-44e5-8127-e10a940c6ae2.root,/store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/85668cee-7f91-496a-8cbe-7aa9c3c75fdb.root,/store/unmerged/Run3Summer22NanoAODv12/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/NANOAODSIM/130X_mcRun3_2022_realistic_v5-v4/40000/857aff8a-2164-430f-a311-3ea4237842aa.root
then cmsRun -j FrameworkJobReport.xml RunMergeCfg.py
; and could reproduce the crash indeed.
one file has 35 reweighting weights and the other 34 ; hence the unmergability. Is there a way to get the MINI and AOD files that led to the creation of the two files above ?
the size of the weights comes from a regexp over the LHE information : https://github.com/cms-sw/cmssw/blob/CMSSW_13_0_X/PhysicsTools/NanoAOD/plugins/GenWeightsTableProducer.cc#L905C36-L905C52 ; @cms-sw/generators-l2 how can the number of weights vary from one job to the other ?
@vlimant we discussed this with PdmV too, now, we will have a look how this happened, a priori, I have no idea as this was not the case before. Will report back asap.
@menglu21 FYI.
is it possible that the MG fails to compute on of the weight set and therefore does not include it for one specific job ? leading to a file, from that job, with only a subset of the weights ; files that cannot be merge with others later on because of the size difference.
@bbilin @menglu21 do you have any news on the issue? Number of affected WFs is increasing, so we need to understand as soon as possible for the fix.
Thanks
Hi all, please let me know if you still need those files. We would like to announce this WF if those files are not needed anymore.
@sunilUIET : please provide a list of the samples that exhibit this failure.
@vlimant here is the list (few weeks back) provided by PnR. @z4027163 can add if he has more complete list
Hi all, I see two ways forward: a.) If we can have the info of seeds from the original wmLHE request of the buggy nano's, we can locally check and see why this happens. b.) We extend runcmsgrid.sh by a line that checks the number of weights against our expectation. if the comparison fails we abort the job. From the failing jobs it should be easy to recover the seed info from the logs. Let me cc other mg5 people @sihyunjeon @cvico @dickychant. @srimanob Do you know if a.) is possible? Anyone else who knows?
Hi all, I see two ways forward: a.) If we can have the info of seeds from the original wmLHE request of the buggy nano's, we can locally check and see why this happens. b.) We extend runcmsgrid.sh by a line that checks the number of weights against our expectation. if the comparison fails we abort the job. From the failing jobs it should be easy to recover the seed info from the logs. Let me cc other mg5 people @sihyunjeon @Cvico @DickyChant. @srimanob Do you know if a.) is possible? Anyone else who knows?
I think a.) seems more important because I just opened an error log 1 which seems to suggest that the error happens at merging nanoaod step. Do we expect this is due to some missing weights?
Yes. A weight entry is missing but it is not clear where this is coming from. So ideally we get the seed that is used for runcmsgrid.sh in the wmLHE step for that specific nano so we can locally reproduce.
Yes. A weight entry is missing but it is not clear where this is coming from. So ideally we get the seed that is used for runcmsgrid.sh in the wmLHE step for that specific nano so we can locally reproduce.
Exactly
A minor question: do we expect this to be reproducible also at NanoGEN level in case we lose the seed and have to start over?
I would guess so. But I think with a small modification of runcmsgrid.sh as proposed above we can also catch it, if indeed we cannot recover seeds of current workflows.
I would guess so. But I think with a small modification of runcmsgrid.sh as proposed above we can also catch it, if indeed we cannot recover seeds of current workflows.
Hope we don't need either ways!
Hi I discussed with @hqucms and checked the corresponding MiniAOD dataset.
So for those files, if we run standard nanov12 sequence from CMSSW_13_0_13, we could already pick up some files that seem to be good (give 35 weights, e.g. /store/mc/Run3Summer22MiniAODv4/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/MINIAODSIM/130X_mcRun3_2022_realistic_v5-v4/2820000/0ec39554-e050-4fdf-96dd-b143efb9cdd2.root
) and some files that seem to be bad (give 34 weights, e.g. /store/mc/Run3Summer22MiniAODv4/TWZ_TtoLNu_WtoLNu_Zto2L_DR1_TuneCP5_13p6TeV_amcatnlo-pythia8/MINIAODSIM/130X_mcRun3_2022_realistic_v5-v4/2820000/90f2a55d-a0cc-43b8-be1f-f51594bdc950.root
).
Taking these two files as examples, if we retrieve the miniaod files and check the LHE Run Info head
# With fwlite
lhehandle = Handle("LHERunInfoProduct")
test_run.getByLabel("externalLHEProducer",lhehandle)
lheruninfo = lhehandle.product() # here you get a list of strings that forms the `XML` LHE header
The relevant output (i.e. the part with reweighting weights) are
- For
good
file:
<weightgroup name="mg_reweighting" weight_name_strategy="includeIdInWeightName">
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo"/>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_m1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_m1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_m1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 13 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_m1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_m1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 23 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_1p_nlo">set param_card dim62f 23 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_m1p_ctg_0p_nlo">set param_card dim62f 19 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_1p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_m1p_nlo">set param_card dim62f 24 -1.0 # orig: 1e-05
</weight>
</weightgroup>
- For
bad
file:
<weightgroup name="mg_reweighting" weight_name_strategy="includeIdInWeightName">
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo"/>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_m1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
set param_card dim62f 22 1.0 # orig: 1e-05
</weight>
<weight id="ctz_1p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 22 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_m1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 15 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_1p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 15 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_m1p_cpq3_0p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 13 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_1p_cpq3_0p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 13 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_m1p_ctw_0p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 23 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_1p_ctg_0p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 19 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_1p_ctw_0p_ctp_0p_ctg_1p_nlo">set param_card dim62f 12 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_m1p_ctp_0p_ctg_0p_nlo">set param_card dim62f 23 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo"/>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_0p_ctg_1p_nlo">set param_card dim62f 23 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_m1p_ctg_0p_nlo">set param_card dim62f 19 -1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_1p_ctg_1p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05
set param_card dim62f 24 1.0 # orig: 1e-05
</weight>
<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_0p_ctp_0p_ctg_m1p_nlo">set param_card dim62f 24 -1.0 # orig: 1e-05
</weight>
</weightgroup>
Let me pick up the one line that has difference:
-
good
:<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo">set param_card dim62f 19 1.0 # orig: 1e-05 set param_card dim62f 23 1.0 # orig: 1e-05 </weight>
-
bad
:<weight id="ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo"/>
It is obvious that both are good xml
syntax, while unfortunately our GEN weight Nano parser only accepts good
one :)
Relevant code: https://github.com/cms-sw/cmssw/blob/b4572d430a07a0a38f665556c54b7e87379065db/PhysicsTools/NanoAOD/plugins/GenWeightsTableProducer.cc#L595
We need to understand why this happens (@agrohsje and @sihyunjeon please correct me but I don't think this happens without reweighting launch
names...?) I might need to look at madgraph source code to understand...
But what also obvious is: the very first line from both files are also not parsable to our Nano weight parser :) Therefore if you counting the weights there are 36 values while we just have either 35 or 34 entries after NANO them.
If we look at the card, at least I think it should have 36 weights.
Cannot reproduce the MG5 header with the same random seed and nevents from the bad
file....
hey interesting finding. just based on quick scanning, maybe this part https://github.com/mg5amcnlo/mg5amcnlo/blob/LTS/models/check_param_card.py#L572-L575 is not working as expected?
For the first lines that we are dropping, at least in madgraph internal it works as it is encoded - as the parameters are the same as the original values set in customization card it prints out nothing. But what is weird is the difference between good and bad file ...
On a separate note, though, that tWZ sample should've not been submitted in the first place since madspin+reweight was already found to be wonky IIRC (will open an issue on this in https://github.com/cms-sw/genproductions)
python3 $CMSSW_RELEASE_BASE/src/Configuration/DataProcessing/test/RunMerge.py --output-file out.root --mergeNANO --input-file
Actually the issue itself is more tricky than this as the relevant bits are:
https://github.com/mg5amcnlo/mg5amcnlo/blob/59b4b9c1238978f39a32b8bc83244328187704b6/madgraph/interface/reweight_interface.py#L870C1-L883C80
For which you clearly see that it is supposed to be always producing <weight> </weight>
syntax. (v265 has slightly different code content but what has been done there is similar, one can easily check this out from untar the gridpack and check this file in the mg5basesdir
)
I think the other VHH sample is also influenced which doesn't have anything todo with the madspin+reweighting issue.
To me, the quicker (and uglier) solution is to fix the regex
pattern we've been using (I don't know if this is a fix because from madgraph source code one would never expect there could be another possible output syntax).
The better solution that works for long term is to leverage existing xml
parser without reinventing the wheel. (like what we did for LHEInterface and Kenneth's PR on refactoring genweighttable if I don't remember things wrongly?)
For which you clearly see that it is supposed to be always producing
syntax.
So somewhere this /weight>
is getting dropped and making />
which i don't understand...
I think the other VHH sample is also influenced which doesn't have anything todo with the madspin+reweighting issue.
Yes that's why i said it's a "separate note"
A lot of useful and confusing info in that thread. Let me catch up: 1.) You connect mini and nano: Did you find the name of the mini input files in the logs of the corrupted nano? Do you have a link? 2.) How did you recover the seed of the wmLHE step? 3.) Do we still have the logs of the wmLHE step? We can fix the regex but I am really worried that the same code executed on different machines produces different output.
A lot of useful and confusing info in that thread. Let me catch up: 1.) You connect mini and nano: Did you find the name of the mini input files in the logs of the corrupted nano? Do you have a link? 2.) How did you recover the seed of the wmLHE step? 3.) Do we still have the logs of the wmLHE step? We can fix the regex but I am really worried that the same code executed on different machines produces different output.
(1): I chatted with @hqucms and we both just thought about running with published miniaods (the published miniaod dataset has ~ 1M events, while the corresponding nano is just 10k so we believed there are buggy files and luckily there are some) I just did condor jobs that runs standard nano sequence and check the merge compatibility after having the nano files and pick up the miniaod that gives good and bad nano output lol
(2): The seed and number of events I got is from the header! Since madgraph running would store the run_card
in the header of LHE files.
(3): Unfortuanately no and I cannot reproduce anything it seems... But I might omit something... I do have the feeling that I did see similar error again but once I modify the mgbasedir
codes to verify my hypothesis on the functional part the error disappeared...
hmmm @DickyChant were you able to find other buggy cases? i am wondering if the bug always affects the same weight block ctz_0p_cpt_0p_cpqm_0p_cpq3_0p_ctw_1p_ctp_1p_ctg_0p_nlo
in this twz sample