Capturing metadata about the file itself
Since the release of version 1.0.0 of SDRF we had discussions about how to capture metadata of the file itself. For example.
- SDRF version: This information now is quite relevant because with this new version we do need to differentiate bettween versions of the format.
- SDRF Template: This will define which template was used to annotate the dataset.
- SDRF Template Version: This will define the version of the template.
Additional information could be things like software annotator, its version, etc.
To serialise this information into SDRF, we can have multiple approaches:
1 - Currently for example, lesSDRF uses extra comment columns to annotate these information, but it is quite repeated information with mainly only one value per column.
source name characteristics[organism] ... source name technology type ... comment[sdrf version] comment[sdrf template]
2- Use the notations of some genomics formats as VCF and use comments for this information something like:
#sdrf version: 1.1.0
#sdrf template: human
#sdrf template version: 1.0.0
#annotation software: lesSDRF
source name characteristics[organism] ... source name technology type ...
3- Use the same version than before but at the bottom of the file, like:
source name characteristics[organism] ... source name technology type ...
#sdrf version: 1.1.0
#sdrf template: human
#sdrf template version: 1.0.0
#annotation software: lesSDRF
2 and 3 solutions are compatible with all python ecosystem including pandas etc. Other solutions?
for tabulated, # comment line is more common if we don't want repeating info. I think comments in the beginning is also more common so 2 would be ok. Would we ever change the number of lines that we would use for these? We will need to have documentation expected syntax for these meta-metadata information and potentially reserve keywords, what sdrf writer can use to store specialized data or internal data? I can think of a few usages for these extra fields.
Agreed. In the first iteration, we want to first know when it will fit. Then we can discuss in this same issue which information to capture and how to encoded properly with the expected names of the fields.
Option 2 is also my preferred one. Some file formats use ## for machine-readable comments (e.g. VCF). Maybe something to consider... though not super important
##fileformat=VCFv4.2
Option 2 is more intuitive and efficient for most use cases.
As discussed during the second meeting, we agree to support version 2. The format will contains ## for comments. In the first version will be captured the following columns:
##fileformat=SDRF
##fileformat version=v1.1.0
##template=human
##template version=v.1.0.0
##source=lesSDRF
##validator certified={hash from SDRF validator}
More metadata could be captured as key = value pairs. We will add most of this keys to PRIDE ontology. Please if you have comments before we add the terms let us know.
I think it would be good to have a tool that allows to download multiple templates combined or to guidelines. Recommened @noatgnu
#fileformat=SDRF; version=v1.0.0
#template=human; version=v1.1.0
#template=vertebrate; version=v1.1.0
#guideline=inmunopeptidomics; version=v1.1.0