opendata.cern.ch
opendata.cern.ch copied to clipboard
2012 datasets - initial considerations
Notes for CMS 2012 data:
Collision data:
dataset=/*/*2012*22Jan2013*/AOD (total of 1.1 PB out of which 510 TB Parked - see https://cds.cern.ch/record/1480607/files/DP2012_022.pdf and also https://profmattstrassler.com/articles-and-posts/lhcposts/triggering-advances-in-2012/data-parking-at-cms/ :smiley: )
A: from the start to May 6: 45 TB (including 2 TB Parked)
B: from May 12 to June 18: 230 TB (95 TB Parked)
C: from July 1 to September 27: 354 TB (155 TB Parked)
D: from September 28 to December 5: 477 TB (260 Parked)
NB (dates approximate, some double-counting in data volumes because of special HZZ, TOPElePlusJets and TOPMuPlusJets processings with 42.9 TB)

TBD:
- [x] Check if it is justified not to include Parked datasets
- [x] Decide whether release A+B+C (appr 14fb-1) or B+C (appr 13fb-1) or C+D (appr 16 fb-1): all these combinations less than 0.5 PB (without Parked)
MC
dataset=/*/*Summer12_DR53X*V7*/AODSIM (total of 2 PB) or
dataset=/*/*Summer12_DR53X*V19*/AODSIM (total of 1 PB)
TBD:
- [x] check with the experts if these are overlapping:
No, the legacy campaign for the 2012 MC (called Summer12DR53X) is the one using the GT: START53_V19 (757.6 TB)
In that campaign there were also some additional special production (mostly run-dependent) using other GTs:
- START53_V19F: BPH run-dependent DIGI-RECO in 5.3. (162.8 TB)
- START53_V19E : digi-reco in 5.3 with specific Muon alignement for EXO (1.3 TB)
- START53_V7N : run dependent for h2gg (254.6 TB)
Open question for V19D (38 TB)
(From Gianluca Germinara and ppd)
- [x] check if there is any obvious rule to decide what to leave out
- [x] check with the 2011 "algorithm" how to divide these in categories
- [x] check with @tiborsimko that all 2011 procedures scale with the increased number of datasets
Include Parked data to the release
@katilp: I'm probably missing something, but I thought I should confirm anyway that you mean 0.5 TB for the second TBD and not 0.5 PB.
@RaoOfPhysics: good point, thanks; I mean 0.5 PB
Recapitulating numbers for the resource request:
Data: 2012 data taking was divided in four runs: RunA, RunB, RunC and RunD with a total of 1.1 PB out of which a part will be released in 2017. The released data will be at maximum 831 TB - in case of releasing RunC + RunD - and at minimum 477 TB - if only RunD is released , other combinations are possible, and to be defined considering the best possible software compatibility with the MC samples. In long-term, CMS may consider releasing the full data, if so decided by the collaboration board, but this will not happen in 2017.
MC: The legacy campaign for the 2012 MC (called Summer12DR53X) is the one using the GT: START53_V19 and sums up to 757.6 TB. In addition, some special productions should be kept (run dependent production for B physics with 162.8 TB, special alignment for muons with 1.3 TB and Higgs to gg with 254.6 TB), summing up to a total of 1.2 PB.
CMS therefore needs for the long-term preservation and open access of these samples 2 PB of disk space in eospublic at CERN, to be served through xrootd (or direct download) from opendata.cern.ch.
Storage space for 2 PB OK for 2017, so we can proceed 😃
Listing of https://cmsweb.cern.ch/das/request?view=plain&limit=3000&instance=prod%2Fglobal&input=dataset%3D%2F*%2FSummer12_DR53XV19*%2FAODSIM
For the Run periods to be released, a good choice could be RunB + RunC. It would amount to 584 TB, and include most of the data that were included in the Higgs discovery analysis. To be discussed with the CMS physics coordination.
Update: From Higgs -> 4l point of view, no contradiction for RunB + RunC. RunD has some more pile-up, but would still be good as well. (from Andre Mendes)
https://cmsweb.cern.ch/das/request?view=list&instance=prod%2Fglobal&input=dataset%3D%2F*%2F2012B22Jan2013*%2FAOD+ 31 datasets Run2012B-22Jan2013_listing.pdf
https://cmsweb.cern.ch/das/request?view=list&instance=prod%2Fglobal&input=dataset%3D%2F*%2F2012C22Jan2013*%2FAOD+ 39 datasets Run2012C-22Jan2013_listing.pdf
NB the list of triggers in each dataset in https://fwyzard.web.cern.ch/fwyzard/hlt/2012/dataset
Confirmed that they are all proper datasets for physics, information also from http://inspirehep.net/record/1467921/files/10.1016_j.nuclphysbps.2015.09.144.pdf (Dataset definition for CMS operations and physics analyses)
In particular from RunB and RunC:
- HTMHTParked (Parked dataset motivated by Susy hadronic searches)
- HcalNZS (technical trigger with HLT_HcalNZS, HLT_HcalPhiSym, HLT_HcalUTCA)
- NoBPTX (technical trigger)
- VBF1Parked (motivated by Vector Boson Fusion)
In addition from RunC:
- LP_ZeroBias
- LP_ExclEGMU
- LP_Jets1
- LP_Jets2
- LP_MinBias1
- LP_MinBias2
- LP_MinBias3
- LP_RomanPots
-> all LP_ datasets have only 5 runs 198899-199903, https://twiki.cern.ch/twiki/bin/view/CMS/CertificationCollisions12 indicates that they are Totem runs: "LHC fill: 2836 Runs: 198899 198900 198901 198902 198903 Comment: Totem run - Special trigger menu - Run 198898 is flagged as cosmic but it is a collision run"
Update Nov 4 2016: The LP_ datasets are the CMS part of the common CMS-TOTEM runs. The data can be analysed only when combined with separate TOTEM data. Acoording to the FSQ PAG conveners, we can leave them out of the release.
The release date will be decided together with the CMS physics coordination making sure that all CMS 8 TeV key analyses will have been published before.
NB: no DoubleMu in 22Jan2013 reprocessing, checking this with Higgs POG, Muon PAG and PPD. Earlier reprocessing available: RunB: https://cmsweb.cern.ch/das/request?view=list&limit=50&instance=prod%2Fglobal&input=dataset%3D%2FDoubleMu%2F2012B%2FAOD RunC: https://cmsweb.cern.ch/das/request?view=list&limit=50&instance=prod%2Fglobal&input=dataset%3D%2FDoubleMu%2F2012C%2FAOD
Explanation: No need for DoubeleMu as DoubleMuParked contains DoubleMu