qca-dataset-submission icon indicating copy to clipboard operation
qca-dataset-submission copied to clipboard

What metadata should we be including inside our QCFractal submissions?

Open jchodera opened this issue 6 years ago • 8 comments

It would be helpful to include more metadata describing the construction of our datasets. Our input directories contain helpful blocks like this:

### General Information

 - Date: 2019-07-21
 - Class: Forcefield Parametrization
 - Purpose: Explore discrepancies between QM and OPLS3e
 - Collection: OptimizationDataset
 - Name: Pfizer discrepancy optimization dataset 1
 - Number of Entries: 100 unique molecules, XXX conformers
 - Submitter: John Chodera

Should we also include this information in metadata for submission? If so, where should we put it?

jchodera avatar Sep 07 '19 17:09 jchodera

A few other items:

  • "tagline" would also be very useful. A single sentence of ~1-200 characters that describes the dataset. This could be the "Purpose" field above, but may be different.
  • "tags" for the dataset, one should certainly be "openff". Perhaps others like "biomolecule" or "force field"? We are still playing with exactly what these are, but others thinking on tags would be useful to us.

dgasmith avatar Sep 18 '19 13:09 dgasmith

"tagline" would also be very useful. A single sentence of ~1-200 characters that describes the dataset. This could be the "Purpose" field above, but may be different.

Can you give some examples?

jchodera avatar Sep 18 '19 13:09 jchodera

What about a description field that allows a more detailed description? (Perhaps that is more sensible than tagline?)

jchodera avatar Sep 18 '19 13:09 jchodera

I believe we would ultimately like both. tagline or similar is useful for when you are presented with a dozen datasets where the name is not sufficiently informative. A description field could be a few paragraphs of additional information.

A few examples:

  • X40 - "Binding energies of noncovalent interactions involving halogenated molecules"
  • Butanediol65 - "Isomerization energies for butanediol"
  • MPCONF196 - "Conformation energies of acyclic and cyclic model peptides and several other macrocycles"

It might be that quantum chemist have arcane names, but it seems to be useful.

dgasmith avatar Sep 18 '19 13:09 dgasmith

OK, so what is the complete set of metadata entries we want to include so far?

jchodera avatar Sep 18 '19 13:09 jchodera

tag, tagline, description?

dgasmith avatar Sep 18 '19 13:09 dgasmith

Sorry, should have been clearer: What's the full set of metadata we should manually specify then? Presumably, this includes things like a dataset name, levels of theory, person generating the data, contact info, etc.

jchodera avatar Sep 18 '19 15:09 jchodera

You perhaps also want a "Source URL (if applicable)" tag or something like that, e.g. for stuff in this repo we'd link to where it came from. This also might help encourage people to... provide links to source materials.

davidlmobley avatar Sep 18 '19 16:09 davidlmobley