nnpdf icon indicating copy to clipboard operation
nnpdf copied to clipboard

[New commondata format] FTDY

Open scarlehoff opened this issue 2 years ago • 43 comments

I'm reopening #1610 here @tgiani This is just a rebase of your commits there.

These metadata.yaml files however contains some errors (and some things that we might want to change?)

  1. The npoints field in hepdata. Do we need this field @enocera ? (if so I'll add it to the reader)
  2. The theory field must always come with an operation even if it is NULL
  3. The FK_tables are actually a list of list.

So, the theory right now is:

theory:
   FK_tables:
     - DYE605

but it should be

theory:
   FK_tables:
     - - DYE605
   operation: "NULL"

(this is due to some FKTables being a concatenation of grids now)

Point 2 is a bit silly in hindsight but for now it is like that...

scarlehoff avatar Feb 23 '23 13:02 scarlehoff

FWIW I find

     - - DYE605

ugly. Can we at least make the examples read like

     - [DYE605]

?

More generally, I don't know if I have the appetite to reopen this, but I feel the theory metadata has deviated quite a bit from the "we would like experimentalist to be able to write this" goal that we had at some point. In addition the current definition has a bunch of backwards compatible and "transactional" semantics. I do wonder if we are better off having all that in a separate file, that would be less stable for the moment.

Zaharid avatar Feb 23 '23 14:02 Zaharid

I don't know if I have the appetite to reopen this, but I feel the theory metadata has deviated quite a bit from the "we would like experimentalist to be able to write this" goal that we had at some point

What do you mean? I don't think it has changed since we defined it in Milan one year ago?

In addition the current definition has a bunch of backwards compatible and "transactional" semantics

like?

scarlehoff avatar Feb 23 '23 14:02 scarlehoff

I don't know if I have the appetite to reopen this, but I feel the theory metadata has deviated quite a bit from the "we would like experimentalist to be able to write this" goal that we had at some point

What do you mean? I don't think it has changed since we defined it in Milan one year ago?

Probably not, but then again one year has passed.

In addition the current definition has a bunch of backwards compatible and "transactional" semantics

like?

https://github.com/NNPDF/nnpdf/pull/1678/files#diff-eb4ccea229f5af5e39ccb1cc2984780e879f823a50477da80cc65ff859717c8bR112-R115

Zaharid avatar Feb 23 '23 15:02 Zaharid

https://github.com/NNPDF/nnpdf/pull/1678/files#diff-eb4ccea229f5af5e39ccb1cc2984780e879f823a50477da80cc65ff859717c8bR112-R115

Ah, but only on the theory part. Not sure how transactional they will be though since we need them for the old theories, unless we completely regenerate the affected grids (specially the ones that skip points in the middle).

scarlehoff avatar Feb 23 '23 15:02 scarlehoff

  1. The npoints field in hepdata. Do we need this field @enocera ? (if so I'll add it to the reader)

This is not required and should be deleted.

enocera avatar Feb 23 '23 15:02 enocera

I've updated E605, E866P/R and E906 to follow the new format proposed by @t7phy

For the name convention I'm following what has been proposed by @cschwan, i.e.,

EXPERIMENT_PROCESS_ENERGY_OBSERVABLE_EXTRA

stopping at the point in which the information makes the dataset unique. In this case for E605 and E906, the commondata is under the E605/E906 folders. Instead for E866 there were two publications and so the data is under the folders E866_FTDY_800GEV_PXSEC and E866_FTDY_800GEV_PDXSECRATIO. Please @cschwan and @felixhekhorn have a look at the name convention and let me know whether it is correct.

I've also added two keys under implemented_observables:

The first is dataset which contains the full name of the dataset to be used in runcards. Note that it is at the top of the observable. This doesn't matter for yaml but it might be a nice way of separating the observables. I think this is useful since otherwise the user needs to "reconstruct" by hand the name which can lead to mistakes (cc @t7phy @enocera I think this would be useful to add in all new commondata).

The second is process, which is necessary in order for the name convention to work.

And one top-level key with energy (same, it is necessary for the naming convention).

I also think the file key under data_uncertainties and data_central should be removed since it is redundant (these two keys already refer to files).

@tgiani one question, you added also a folder DYE866P_old_kinematic but this is in yet a different format. Was this just a leftover? Should I remove it? Or is it a valid variant of DYE866P?

Please, let me know whether all this is correct and whether I can move forward with the reader using these files as target.

scarlehoff avatar Mar 23 '23 12:03 scarlehoff

I'd

  • ~always give at least the experiment and the process (maybe also the energy?). For E605 there are also datasets in which the muon-pair comes from an upsilon resonance, so specifying the process makes it unique again;~
  • use DY instead FTDY because the process part is supposed to specify the final state and the experiment/variant usually specify the initial states. Drell-Yan at Tevatron (ppbar) we also call Drell-Yan.

cschwan avatar Mar 23 '23 12:03 cschwan

@scarlehoff if you have a look at https://github.com/NNPDF/nnpdf/blob/more_efficient_metadata_for_new_commondata/buildmaster/CMS_ttBar_8TeV_lj_dif/metadata.yaml#L3 and also have a look at line 23, 48, 73, etc. which contain the observable_name, if you were to concat the line 3 setname with observable_name, you can uniquely identify every observable. In the above example, I will have to capitalize them ofc. In your example, the observable_name would just be empty as the implementation is only with a single observable which is uniquely identified by the setname itself. I would therefore suggest to drop the dataset key in every implemented observable, which is redundant (due to setname). I would also suggest dropping of process and energy from every single implemented observable, because again, the setname at the top should explain everything. I do not think it would be a good to over-complicate the reader by requiring to to read so many different keys just to identify a single variable. Also, the user is expected to be able to construct the dataset name without making errors because how else do you name the directory where the implementation will be made in the first place? Concerning the file key, I guess it helps main consistency with the kinematics part of the metadata block, but I don't have any strong preference there.

t7phy avatar Mar 23 '23 13:03 t7phy

I do not think it would be a good to over-complicate the reader by requiring to to read so many different keys just to identify a single variable.

Actually for the reader I would very much like to have all fields: experiment, energy, process and observable. And 4 of these 3 are required anyway (the process is needed for MHOU). Energy is new but I'm guessing it will be necessary in all LHC datasets.

Of course, we can say that the minimum name for the commondata is EXPERIMENT_ENERGY_PROCESS* and then it is not necessary to have it in the metadata. That will also be fine. Can we have one paper with two different processes/energies?

*with extra information being added in a case-by-case basis

scarlehoff avatar Mar 23 '23 13:03 scarlehoff

I do not think it would be a good to over-complicate the reader by requiring to to read so many different keys just to identify a single variable.

Actually for the reader I would very much like to have all fields: experiment, energy, process and observable. And 4 of these 3 are required anyway (the process is needed for MHOU). Energy is new but I'm guessing it will be necessary in all LHC datasets.

Of course, we can say that the minimum name for the commondata is EXPERIMENT_ENERGY_PROCESS and then it is not necessary to have it in the metadata. That will also be fine. Can we have one paper with two different processes/energies?

Actually I have implemented both cases. Papers with different processes and papers with different energies. For example H1 datasets containing jets and dijets, or LHC papers containing both 7 and 8 TeV datsets. In these cases we separate the implementations for the processes or energies (suggest by @enocera), so the setname+observable_name can be used for unique identification without any issues.

t7phy avatar Mar 23 '23 13:03 t7phy

So I should rename:

E605 -> E605_DY_38P8GEV E906 -> E906_DY_120GEV

And if one wants to add another process from E605 then that will be a separate folder even if it happens to be the same paper.

Did I understand it correctly?

scarlehoff avatar Mar 23 '23 13:03 scarlehoff

So I should rename:

E605 -> E605_DY_38P8GEV E906 -> E906_DY_120GEV

And if one wants to add another process from E605 then that will be a separate folder even if it happens to be the same paper.

Did I understand it correctly?

Si

t7phy avatar Mar 23 '23 13:03 t7phy

I would say even the dataset key in every observable is redundant. All you need is the global key: setname, and the observable specific key: observable_name for unique identification

t7phy avatar Mar 23 '23 13:03 t7phy

Ok, I've updated now with all the suggested changes.

I've left the dataset key since in the current naming convention it is necessary due to the following:

In general the name of the dataset will be: _ and is EXPERIMENT_PROCESS_ENERGY, so the full name is EXPERIMENT_PROCESS_ENERGY_OBSERVABLE

However, in E866 the folder also includes the observable name, therefore the observable would get added twice. Note that this is also a problem currently in #1684 which is called CMS_TTBAR_8TEV_LJ_DIFF where under this convention LJ_DIFF would be the observable however the actual observables still need to concatenate things at the end: CMS_TTBAR_8TEV_LJ_DIFF_PT, etc

What is the proposed solution for this? As I've said before I would like to avoid having keys that appear in some metadata files and not in others regarding the names.

(I don't have any strong opinion regarding what the correct way should be, I just want to know for sure that I have only one rule to create the name and that I can apply the same rule to all commondata).

scarlehoff avatar Mar 23 '23 13:03 scarlehoff

Ok, I've updated now with all the suggested changes.

I've left the dataset key since in the current naming convention it is necessary due to the following:

In general the name of the dataset will be: _ and is EXPERIMENT_PROCESS_ENERGY, so the full name is EXPERIMENT_PROCESS_ENERGY_OBSERVABLE

However, in E866 the folder also includes the observable name, therefore the observable would get added twice. Note that this is also a problem currently in #1684 which is called CMS_TTBAR_8TEV_LJ_DIFF where under this convention LJ_DIFF would be the observable however the actual observables still need to concatenate things at the end: CMS_TTBAR_8TEV_LJ_DIFF_PT, etc

What is the proposed solution for this? As I've said before I would like to avoid having keys that appear in some metadata files and not in others regarding the names.

(I don't have any strong opinion regarding what the correct way should be, I just want to know for sure that I have only one rule to create the name and that I can apply the same rule to all commondata).

As I wrote above, for the E866, the simplest solution is to just have an empty value for observable_name as the setname itself uniquely identifies it. As for the convention, it is EXPERIMENT_PROCESS_ENERGY_MISC.INFO_OBSERVABLENAME, this misc info is important as there are datasets in different papers with slight variations (luminosity for instance). The reader could get the exp name by looking at the what comes before 1st _, process before 2nd _ and so on, however the observable name could be obtained by looking at what comes after the last _. I will need to make lots of changes too, because I usually named observables as M_TTBAR or Y_TTBAR which I will modify to MTTBAR and YTTBAR to ensure this convention. Does this seem fine?

t7phy avatar Mar 23 '23 13:03 t7phy

@tgiani one question, you added also a folder DYE866P_old_kinematic but this is in yet a different format. Was this just a leftover? Should I remove it? Or is it a valid variant of DYE866P?

Please, let me know whether all this is correct and whether I can move forward with the reader using these files as target.

@scarlehoff that was implemented to cross-check some old implementation of the data. I added it for completeness as asked by @enocera, not sure if we want to keep it

tgiani avatar Mar 23 '23 14:03 tgiani

Ok, that will basically leave the following convention:

<EXPERIMENT>_<PROCESS>_<ENERGY>_<FREE FIELDS>_<OBSERVABLENAME>

where OBSERVABLENAME is always included in the metadata.yaml and if necessary (case of E866) left empty. The free fields can separate only at the topmost level but I think that's ok, basically any possible ambiguity is "casted" into a separation of folders.

I'm happy with it :D

If @cschwan and @felixhekhorn (also @enocera ofc but I'm assuming he's also ok with it) confirm I will update the datasets here and add it to the docs when dealing with #1691.

scarlehoff avatar Mar 23 '23 14:03 scarlehoff

@t7phy please have a look at the current version, if everything is ok I'll use these as the "examples" for the reader.

@enocera do you want me to keep DYE866P_old_kinematic or can I just remove it?

scarlehoff avatar Mar 24 '23 07:03 scarlehoff

@t7phy please have a look at the current version, if everything is ok I'll use these as the "examples" for the reader.

@enocera do you want me to keep DYE866P_old_kinematic or can I just remove it?

I think it's good to go.

t7phy avatar Mar 24 '23 08:03 t7phy

@t7phy please have a look at the current version, if everything is ok I'll use these as the "examples" for the reader.

@enocera do you want me to keep DYE866P_old_kinematic or can I just remove it?

@scarlehoff Let me check what that was about before dropping it, please.

enocera avatar Mar 24 '23 08:03 enocera

While one can reconstruct the whole dataset name from the information contained in the foldername + observable... it is impossible to know what the foldername is to be if the observable can be included in the name.

I think we need a different rule for cases la E866 where we have: E866_DY_800GEV_PXSEC and E866_DY_800GEV_PDXSECRATIO

but one could be evil and add as well

E866_DY_800GEV

which has inside a PXSEC observable.

Both those cases lead to a dataset called E866_DY_800GEV_PXSEC.

I would prefer if we have a convention in which such an ambiguity cannot be generated.

One possibility is not allowing the observable to be put in the name. If one needs to disentangle two dataset then some extra needs to be used.

scarlehoff avatar Mar 29 '23 13:03 scarlehoff

Also, instead of

implemented_observables:
  - observable_name: "ob1"
    ...
  - observable_name: "ob2"
    ...

I think it would be better:

implemented_observables:
    obs1:
        data: whatever.dat
        plotting: whatever
    obs2:
       ....

So that one doesn't need to load all implemented observable in order to look for the desired observable but can rather do implemented_observables[observable_name].

Another question is whether we want to allow for badly constructed observable not to break the good one. So I only load the information for the observable that I actually want so that if obs2 is broken then as long as I don't ask for obs2 everything is fine. Or whether instead we want to crash as soon as one observable is not ok.

scarlehoff avatar Mar 29 '23 13:03 scarlehoff

While one can reconstruct the whole dataset name from the information contained in the foldername + observable... it is impossible to know what the foldername is to be if the observable can be included in the name.

I think we need a different rule for cases la E866 where we have: E866_DY_800GEV_PXSEC and E866_DY_800GEV_PDXSECRATIO

but one could be evil and add as well

E866_DY_800GEV

which has inside a PXSEC observable.

Both those cases lead to a dataset called E866_DY_800GEV_PXSEC.

I would prefer if we have a convention in which such an ambiguity cannot be generated.

One possibility is not allowing the observable to be put in the name. If one needs to disentangle two dataset then some extra needs to be used.

I think there is a very simple solution. Remember how we had created a dictionary for backward compatibility which was basically for older datasets which provide a folder name and and observable name. We could use this dictionary for these special cases too, although here it would work in the reverse order as follows: it searches for E866_DY_800GEV folder because of E866_DY_800GEV_PXSEC name, which it doesn't find and then it searches in the dictionary. In the dictionary, it is pointed to E866_DY_800GEV_PXSEC folder and an empty name for the observable. This would work for both E866_DY_800GEV_PXSEC and E866_DY_800GEV_PDXSECRATIO if you have a look at the structure of the dictionary file in the https://github.com/NNPDF/nnpdf/blob/more_efficient_metadata_for_new_commondata/buildmaster/backward_compatibility.yaml

t7phy avatar Mar 29 '23 13:03 t7phy

The problem with that solution is that it adds a third way of loading the data. I would prefer to have one single way of reading the data. Given a well constructed dataset name I should know exactly where to go to read the data (the old names are not "well constructed" but we cannot do anything about that).

The compatibility dictionary should be limited to the specific case of reading old fits and not be used for anything else. If we want to use a dictionary to tell vp where to load the data from, we can do that, but then all datasets should be read from there (but I don't agree with this).

scarlehoff avatar Mar 29 '23 13:03 scarlehoff

Also, instead of

implemented_observables:
  - observable_name: "ob1"
    ...
  - observable_name: "ob2"
    ...

I think it would be better:

implemented_observables:
    obs1:
        data: whatever.dat
        plotting: whatever
    obs2:
       ....

I don't have any opinion on this, if you think this would be better, that's ok with me.

So that one doesn't need to load all implemented observable in order to look for the desired observable but can rather do implemented_observables[observable_name].

Another question is whether we want to allow for badly constructed observable not to break the good one. So I only load the information for the observable that I actually want so that if obs2 is broken then as long as I don't ask for obs2 everything is fine. Or whether instead we want to crash as soon as one observable is not ok.

I guess better to load what is needed only. An important idea of this new metadata format is to be able to implement other observables whenever we want, however it would be bad if a new implementation breaks something already implemented.

t7phy avatar Mar 29 '23 13:03 t7phy

The problem with that solution is that it adds a third way of loading the data. I would prefer to have one single way of reading the data. Given a well constructed dataset name I should know exactly where to go to read the data (the old names are not "well constructed" but we cannot do anything about that).

The compatibility dictionary should be limited to the specific case of reading old fits and not be used for anything else. If we want to use a dictionary to tell vp where to load the data from, we can do that, but then all datasets should be read from there (but I don't agree with this).

Although I agree that in principal we should have a consistent way we read them, as we know, this particular set is an exception so I guess we could make an exception for this case (and if so any comes in the future where we use the dictionary, which is actually meant for something else, to help here). I am not sure of the problem here. As you mentioned, the dictionary is to help with improper names, which is indeed what the case is here because of how the dataset is provided to us in the first place.

t7phy avatar Mar 29 '23 13:03 t7phy

While this problem is exposed by this specific dataset, the ambiguity is completely general. I would prefer one consistent way that works for all datasets we already have.

The whole idea of this new metadata format is to be able to implement other observables whenever we want, however it would be bad if a new implementation breaks something already implemented.

I guess the advantage of being strict is that we ensure that we are not using commondata with broken observables (which is probably a good thing)

scarlehoff avatar Mar 29 '23 13:03 scarlehoff

(on the other hand, by being strict the whole commondata is loaded and then it doesn't matter whether we choose the list or the dictionary approach... so I guess I'll go for the strict checking so that the datasets which are already implemented are ok)

scarlehoff avatar Mar 29 '23 13:03 scarlehoff

While this problem is exposed by this specific dataset, the ambiguity is completely general. I would prefer one consistent way that works for all datasets we already have.

Well then the only other option I see is the option I had mentioned the other day in case of this dataset. We all the references to be a list, i.e. multiple arxiv, inspire and hepdata sources. This way if a dataset is provided to us from multiple sources even though it is for the same experiment/process/etc., it can still be in the same folder. Then PXSEC and PDXSECRATIO become observable names as we would like them to be. The ambiguity goes away because then we have a strict convention that there must be observable name, and it can't be left empty.

The whole idea of this new metadata format is to be able to implement other observables whenever we want, however it would be bad if a new implementation breaks something already implemented.

I guess the advantage of being strict is that we ensure that we are not using commondata with broken observables (which is probably a good thing)

Well technically, the new implementations would go through PR reviews and not really be broken unless someone merges without review :P but fine

t7phy avatar Mar 29 '23 13:03 t7phy

But at that point, wouldn't it be easier to be strict with the convention, i.e., <folder_name>_<observable> and in this case and other which might arise add _extras. Would that be fine?

scarlehoff avatar Mar 29 '23 13:03 scarlehoff