proteomics-sample-metadata Proposal to restructure the specification

Hi @bigbio/collaborators and all SDRF-Proteomics contributors,

SDRF-Proteomics is growing, which is great news! More datasets are being deposited in ProteomeXchange, and new use cases continue to emerge, requiring support. The SDRF community is actively working on multiple specifications, templates, and annotation guidelines for different analytical methods and experimental designs, including metaproteomics, affinity proteomics, bacterial proteomics, crosslinking datasets, and more.

At the same time, we are receiving an increasing number of questions about metadata annotation—such as how to describe patient preconditions or how to annotate tumor size. While expanding the specification to accommodate these needs is important, we must ensure that the core specification remains maintainable and user-friendly. We also want to avoid the challenges of managing multiple overlapping GitHub PRs, branches, and forks against a single document, which can make integration difficult.

To address these challenges, I propose the following structured approach for the SDRF-Proteomics specification (Figure 1):

Main SDRF-Proteomics Specification

This document defines the purpose of the format, supported ontologies, column structures, and the core rules governing SDRF-Proteomics. It also includes multiple main templates associated with it.

Guidelines for Specific Use Cases

These guidelines define the structure, rules, and standardization for specific use cases, analytical methods, and experimental designs. For example, we are developing SDRF-Metaproteomics to standardize how metaproteomics datasets should be structured. These guidelines are independent of the main document but remain linked to it, with corresponding templates (e.g., Gut Metaproteomics).

Sample Metadata Guidelines

These guidelines are not tied to a specific experiment type or analytical method but provide a standardized approach to capturing metadata for different sample types. For instance, we could define best practices for recording patient preconditions across all human disease samples in SDRF. These guidelines serve as recommendations and tutorials rather than mandatory specification rules.

Versioning Proposal

To maintain clarity in updates, I propose the following versioning approach:

Major release X.0.0 → Y.0.0: When we introduce changes to the Main SDRF-Proteomics Specification (1).
Minor release X.Y.0 → X.Y+1.0: When we update or add new Guidelines for Specific Use Cases (2).
Patch release X.Y.Z → X.Y.Z+1: When we refine Sample Metadata Guidelines (3) or make small updates to - existing guidelines.

This structured approach ensures that SDRF-Proteomics remains flexible, scalable, and easy to maintain while supporting the growing needs of the community.

Looking forward to your thoughts and feedback!

Mar 14 '25 12:03 ypriverol

Sounds like a good idea!

As part of this process it may be worth considering steps to ensure that all annotations here (and perhaps on PX?) remain valid or get updated when the specification is updated. This seems like it would require a clear set of rules to quickly determine which guidelines apply to which SDRF files, programmatically. Sounds tedious but could be useful.

Then we could also think about validators implementing the corresponding "pluggable" logic for specific use cases, i.e. basically the machine-readable version of the specification, like e.g. pydantic rules. However, that's probably a topic for a separate discussion.

If we do manage to split it into well-defined chunks, we can think also if it would be beneficial to version them independently. I feel like most of them would almost never change after being added, though.

Mar 14 '25 15:03 levitsky

I think we have some people in the past interested in modifying the validator to pydantic rules (https://github.com/bigbio/sdrf-pipelines/issues/159), as you said @levitsky this is a different discussion. I don't know how much has been done, but we can trigger the discussion if you are interested and try to find a student or we can do it all together in a hackathon. But, good point as the number of rules, and templates continue growing a programmatic way is needed for rules.

Mar 14 '25 15:03 ypriverol