opendata.cern.ch icon indicating copy to clipboard operation
opendata.cern.ch copied to clipboard

CMS: better documentation of CSV columns

Open tiborsimko opened this issue 11 years ago • 19 comments

The CSV files of CMS contain typically the following column description:

Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

It would be useful (both to users and for long-term preservation purposes) to store expanded semantics behind the columns.

P.S. See also forthcoming ATLAS Higgs challenge CSV files where the columns are described in an accompanying PDF documentation. It would be useful to store the meaning in the metadata and/or next to files as well.

tiborsimko avatar Dec 03 '14 09:12 tiborsimko

Would it be on @tpmccauley to provide that?

pherterich avatar Jan 26 '15 16:01 pherterich

It would. What would be the best medium?

tpmccauley avatar Jan 26 '15 17:01 tpmccauley

I don't know. Upload a separate txt file? I could hack it into the metadata as we did for ATLAS. Probably depends on how extensive you want to make it. You're the expert :).

pherterich avatar Jan 27 '15 08:01 pherterich

@tpmccauley @pherterich Since the colum descriptions can change as file versions change, it makes sense to store them in a versioned way for better preservability. For ATLAS, we have stored them simply in the record itself, in the MARC tag 505:

  • https://github.com/cernopendata/opendata.cern.ch/blob/master/invenio_opendata/testsuite/data/atlas/atlas-higgs-challenge-2014.xml#L36-171

Provided the column descriptions are reasonably short, this seems acceptable.

If the descriptions may tend to get very long, or if a record contains more than one kind of CSV files, then we'd better store them apart, in a sort of additional "meaning" file next to the data file. E.g. for file foo.csv we could introduce foo-csv-column-description.json or somesuch that would describe columns in a machine-processable way. @pherterich @suenjedt is there some recommendation in the DP world on how "raw data files" and their "meaning files" are coupled together, without going too deeply into the LD and RDF world? (unless we want to go there already?)

In any case, it would be great to use the same technique for ATLAS and CMS records, so that we can use uniform tools to manage/show CSV column descriptions for all our records.

tiborsimko avatar Jan 27 '15 13:01 tiborsimko

I just checked examples from PANGAEA and they just label the files and include the parameters in the metadata which is then linked to the ISO standards they use. So they real meaning is outside the file. Also Genbank examples don't define a lot but just label lines and rows and expect the user to know what it means or provide glossaries where this could be looked up. Depending on the information it can go in the glossary or the metadata and once we have our first more complex case, we hopefully will have made some progress with RDF etc. Does that sound feasible?

pherterich avatar Jan 28 '15 12:01 pherterich

Thanks for checking, I'd vote then for keeping the current policy, i.e. to store brief column descriptions in records in the MARC tag 505, as we did for ATLAS. (and we'd eventually refactor later)

tiborsimko avatar Jan 28 '15 13:01 tiborsimko

@tpmccauley Can you please take care of describing the CSV columns?

tiborsimko avatar Jul 16 '15 15:07 tiborsimko

@tiborsimko @pherterich

I describe the csv fields here: https://github.com/cernopendata/opendata.cern.ch/blob/master/invenio_opendata/base/templates/visualise_histograms.html#L199

so that they show up when one hovers over the parameter button in the histogram application: http://opendatadev.cern.ch/visualise/histograms/CMS

Is this sufficient?

tpmccauley avatar Mar 15 '16 16:03 tpmccauley

The templates may change, so I think it's better to use the description in the record themselves, as mentioned in https://github.com/cernopendata/opendata.cern.ch/issues/728#issuecomment-71651197

Perhaps @AnxhelaDani can prepare a metadata update PR based on your template text?

tiborsimko avatar Mar 18 '16 13:03 tiborsimko

@tiborsimko @ArtemisLav @tpmccauley I've added this to fast lane, I think should be quite easy to implement (i.e. add the description to the record). If not, feel free to change

katilp avatar May 24 '17 20:05 katilp

@ArtemisLav Can you please create the corresponding MARC tags 505 $t $g?

tiborsimko avatar Jun 12 '17 09:06 tiborsimko

@tpmccauley What about the csv files with px, py and pz and such? Or are all the csv file currently on the portal with pt, eta and phi (i.e. without momentum components separately)? Thanks!

katilp avatar Jun 12 '17 09:06 katilp

@ArtemisLav Here is an overview of all the CMS CSV file headers:

==> 4lepton.csv <==
Event,Run,E1,px1,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,E3,px3,py3,pz3,pt3,eta3,phi3,Q3,E4,px4,py4,pz4,pt4,eta4,phi4,Q4,M

==> dielectron100k.csv <==
Run,Event,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> dielectron-Jpsi.csv <==
Run,Event,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> dielectron-Upsilon.csv <==
Run,Event,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> dimuon100k.csv <==
Type,Run,Event,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> dimuon-Jpsi.csv <==
Type,Run,Event,E1,px1,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> diphoton.csv <==
Run,Event,pt1,eta1,phi1,pt2,eta2,phi2,M

==> masterclass_10-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_10-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_10-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_11-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_11-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_11-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_12-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_12-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_12-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_13-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_13-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_13-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_14-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_14-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_14-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_15-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_15-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_15-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_16-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_16-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_16-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_17-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_17-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_17-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_18-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_18-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_18-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_19-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_19-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_19-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_1-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_1-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_1-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_2-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_2-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_2-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_3-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_3-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_3-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_4-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_4-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_4-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_5-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_5-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_5-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_6-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_6-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_6-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_7-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_7-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_7-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_8-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_8-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_8-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_9-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_9-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_9-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_0.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_1.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_2.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_3.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_4.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_5.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_6.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_7.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_8.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_9.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Wenu.csv <==
Run,Event,E,px,py,pz,pt,eta,phi,Q,MET,phiMET

==> Wmunu.csv <==
Run,Event,E,px,py,pz,pt,eta,phi,Q,MET,phiMET

==> Zee.csv <==
Type,Run,Event,E1,px1,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Zmumu.csv <==
Type,Run,Event,E1,px1,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

tiborsimko avatar Jun 12 '17 13:06 tiborsimko

@katilp the fields in the csv files differ. For some there was an explicit request for E,px,py,pz to be included in addition to pt, eta, and phi.

So is 505 the field one should use to describe the csv column contents?

tpmccauley avatar Jun 12 '17 13:06 tpmccauley

@tpmccauley yes, that is the field (e.g. see here http://opendata.cern.ch/record/554/export/xm). Will you be taking care of that or should we?

ArtemisLav avatar Jun 12 '17 13:06 ArtemisLav

@ArtemisLav I'm not sure I have the time to do it for the already-included csv files but can do it for the new ones I am preparing.

tpmccauley avatar Jun 12 '17 13:06 tpmccauley

@tpmccauley that would be great, thanks, and in the meantime, I can do the ones we already have.

ArtemisLav avatar Jun 12 '17 14:06 ArtemisLav

@ArtemisLav I will document the new csv files today. That information can be used for the older ones.

tpmccauley avatar Jun 13 '17 09:06 tpmccauley

@tiborsimko I think this is now addressed through #2784 Most of the datasets in http://opendata-dev.cern.ch/search?page=2&size=20&q=&experiment=CMS&file_type=csv&subtype=Derived&type=Dataset# now have the dataset semantics with the description of the variables apart from

  • http://opendata-dev.cern.ch/record/5200
  • http://opendata-dev.cern.ch/record/300
  • http://opendata-dev.cern.ch/record/310

katilp avatar Aug 20 '20 12:08 katilp