qca-dataset-submission
qca-dataset-submission copied to clipboard
What metadata should we be including inside our QCFractal submissions?
It would be helpful to include more metadata describing the construction of our datasets. Our input directories contain helpful blocks like this:
### General Information
- Date: 2019-07-21
- Class: Forcefield Parametrization
- Purpose: Explore discrepancies between QM and OPLS3e
- Collection: OptimizationDataset
- Name: Pfizer discrepancy optimization dataset 1
- Number of Entries: 100 unique molecules, XXX conformers
- Submitter: John Chodera
Should we also include this information in metadata for submission? If so, where should we put it?
A few other items:
- "tagline" would also be very useful. A single sentence of ~1-200 characters that describes the dataset. This could be the "Purpose" field above, but may be different.
- "tags" for the dataset, one should certainly be "openff". Perhaps others like "biomolecule" or "force field"? We are still playing with exactly what these are, but others thinking on tags would be useful to us.
"tagline" would also be very useful. A single sentence of ~1-200 characters that describes the dataset. This could be the "Purpose" field above, but may be different.
Can you give some examples?
What about a description field that allows a more detailed description? (Perhaps that is more sensible than tagline?)
I believe we would ultimately like both. tagline or similar is useful for when you are presented with a dozen datasets where the name is not sufficiently informative. A description field could be a few paragraphs of additional information.
A few examples:
X40- "Binding energies of noncovalent interactions involving halogenated molecules"Butanediol65- "Isomerization energies for butanediol"MPCONF196- "Conformation energies of acyclic and cyclic model peptides and several other macrocycles"
It might be that quantum chemist have arcane names, but it seems to be useful.
OK, so what is the complete set of metadata entries we want to include so far?
tag, tagline, description?
Sorry, should have been clearer: What's the full set of metadata we should manually specify then? Presumably, this includes things like a dataset name, levels of theory, person generating the data, contact info, etc.
You perhaps also want a "Source URL (if applicable)" tag or something like that, e.g. for stuff in this repo we'd link to where it came from. This also might help encourage people to... provide links to source materials.