proteomics-sample-metadata icon indicating copy to clipboard operation
proteomics-sample-metadata copied to clipboard

Assay name column should be unique

Open noatgnu opened this issue 3 months ago • 16 comments

Update specifications for guidance on naming syntax for assay name column to ensure that all values within the column is unique.

Update the examples and annotated SDRF files to ensure compliant with assay name column only have unique identification values.

noatgnu avatar Sep 03 '25 10:09 noatgnu

Below is the list of SDRF files in annotated projects folder that need to be update to be compliant

PMID33212010/PMID33212010.sdrf.tsv
PXD000070/PXD000070.sdrf.tsv
PXD000228/PXD000228.sdrf.tsv
PXD000396/PXD000396.sdrf.tsv
PXD000527/PXD000527.sdrf.tsv
PXD000534/PXD000534.sdrf.tsv
PXD000547/PXD000547.sdrf.tsv
PXD000548/PXD000548.sdrf.tsv
PXD000651/PXD000651.sdrf.tsv
PXD000652/PXD000652.sdrf.tsv
PXD000759/PXD000759.sdrf.tsv
PXD000793/PXD000793.sdrf.tsv
PXD000815/PXD000815.sdrf.tsv
PXD000857/PXD000857.sdrf.tsv
PXD000895/PXD000895.sdrf.tsv
PXD000923/PXD000923.sdrf.tsv
PXD000999/PXD000999.sdrf.tsv
PXD001061/PXD001061.sdrf.tsv
PXD001168/PXD001168.sdrf.tsv
PXD001224/PXD001224.sdrf.tsv
PXD001381/PXD001281.sdrf.tsv
PXD001468/PXD001468.sdrf.tsv
PXD001487/PXD001487.sdrf.tsv
PXD001558/PXD001558.sdrf.tsv
PXD001684/PXD001684.sdrf.tsv
PXD001774/PXD001774.sdrf.tsv
PXD002088/PXD002088.sdrf.tsv
PXD002192/PXD002192.sdrf.tsv
PXD002222/PXD002222.sdrf.tsv
PXD002255/PXD002255.sdrf.tsv
PXD002266/PXD002266.sdrf.tsv
PXD002370/PXD002370.sdrf.tsv
PXD002395/PXD002395.sdrf.tsv
PXD002756/PXD002756.sdrf.tsv
PXD002815/PXD002815.sdrf.tsv
PXD002870/PXD002870.sdrf.tsv
PXD003028/PXD003028.sdrf.tsv
PXD003133/PXD003133.sdrf.tsv
PXD003209/PXD003209.sdrf.tsv
PXD003531/PXD003531.sdrf.tsv
PXD003539/PXD003539.sdrf.tsv
PXD003668/PXD003668.sdrf.tsv
PXD004143/PXD004143.sdrf.tsv
PXD004242/PXD004242.sdrf.tsv
PXD004452/PXD004452-tissues.sdrf.tsv
PXD004603/PXD004603.sdrf.tsv
PXD004604/PXD004604.sdrf.tsv
PXD004613/PXD004613.sdrf.tsv
PXD004617/PXD004617.sdrf.tsv
PXD004618/PXD004618.sdrf.tsv
PXD004683/PXD004683.sdrf.tsv
PXD004705/PXD004705.sdrf.tsv
PXD004732/PXD004732.sdrf.tsv
PXD004903/PXD004903.sdrf.tsv
PXD004939/PXD004939.sdrf.tsv
PXD004987/PXD004987.sdrf.tsv
PXD005163/PXD005163.sdrf.tsv
PXD005171/PXD005171.sdrf.tsv
PXD005172/PXD005172.sdrf.tsv
PXD005174/PXD005174.sdrf.tsv
PXD005175/PXD005175.sdrf.tsv
PXD005176/PXD005176.sdrf.tsv
PXD005177/PXD005177.sdrf.tsv
PXD005207/PXD005207.sdrf.tsv
PXD005241/PXD005241.sdrf.tsv
PXD005355/PXD005355.sdrf.tsv
PXD005366/PXD005366-rattus.sdrf.tsv
PXD005445/PXD005445.sdrf.tsv
PXD005463/PXD005463.sdrf.tsv
PXD005554/PXD005554.sdrf.tsv
PXD005780/PXD005780.sdrf.tsv
PXD005819/PXD005819.sdrf.tsv
PXD006003/PXD006003.sdrf.tsv
PXD006132/PXD006132.sdrf.tsv
PXD006233/PXD006233.sdrf.tsv
PXD006401/PXD006401.sdrf.tsv
PXD006430/PXD006430-silac.sdrf.tsv
PXD006430/PXD006430-tmt.sdrf.tsv
PXD006675/PXD006675.sdrf.tsv
PXD006877/PXD006877.sdrf.tsv
PXD006914/PXD006914.sdrf.tsv
PXD007073/PXD007073.sdrf.tsv
PXD007160/PXD007160.sdrf.tsv
PXD007555/PXD007555.sdrf.tsv
PXD008222/PXD008222.sdrf.tsv
PXD008840/PXD008840.sdrf.tsv
PXD008841/PXD008841.sdrf.tsv
PXD009157/PXD009157.sdrf.tsv
PXD009203/PXD009203.sdrf.tsv
PXD009465/PXD009465.sdrf.tsv
PXD009602/PXD009602.sdrf.tsv
PXD009909/PXD009909.sdrf.tsv
PXD010154/PXD010154.sdrf.tsv
PXD010371/PXD010371.sdrf.tsv
PXD010429/PXD010429.sdrf.tsv
PXD010595/PXD010595.sdrf.tsv
PXD010708/PXD010708.sdrf.tsv
PXD011175/PXD011175.sdrf.tsv
PXD011799/PXD011799.sdrf.tsv
PXD011839/PXD011839.sdrf.tsv
PXD011967/PXD011967.sdrf.tsv
PXD012143/PXD012143.sdrf.tsv
PXD012243/PXD012243.sdrf.tsv
PXD012307/PXD012307.sdrf.tsv
PXD012593/PXD012593-rat.sdrf.tsv
PXD012593/PXD012593-srm.sdrf.tsv
PXD012755/PXD012755.sdrf.tsv
PXD012764/PXD012764.sdrf.tsv
PXD012986/PXD012986.sdrf.tsv
PXD013234/PXD013234.sdrf.tsv
PXD013523/PXD013523.sdrf.tsv
PXD013753/PXD013753.sdrf.tsv
PXD013868/PXD013868.sdrf.tsv
PXD013923/PXD013923.sdrf.tsv
PXD014145/PXD014145.sdrf.tsv
PXD014502/PXD014502.sdrf.tsv
PXD014525/PXD014525-dda.sdrf.tsv
PXD014525/PXD014525-dia.sdrf.tsv
PXD014528/PXD014528.sdrf.tsv
PXD014565/PXD014565.sdrf.tsv
PXD015093/PXD015093-LFQ.sdrf.tsv
PXD015093/PXD015093-TMT.sdrf.tsv
PXD015744/PXD015744.sdrf.tsv
PXD015833/PXD015833-Exp1.sdrf.tsv
PXD015833/PXD015833-Exp2.sdrf.tsv
PXD015833/PXD015833-Exp3.sdrf.tsv
PXD015833/PXD015833-Exp4.sdrf.tsv
PXD015833/PXD015833-Exp5.sdrf.tsv
PXD015833/PXD015833-Exp6.sdrf.tsv
PXD017035/PXD017035.sdrf.tsv
PXD017201/PXD017201.sdrf.tsv
PXD017291/PXD017291-mixed-label.sdrf.tsv
PXD017291/PXD017291-tmt.sdrf.tsv
PXD017602/PXD017602.sdrf.tsv
PXD017710/PXD017710-silac.sdrf.tsv
PXD017710/PXD017710-tmt.sdrf.tsv
PXD018357/PXD018357.sdrf.tsv
PXD018970/PXD018970.sdrf.tsv
PXD019113/PXD019113.sdrf.tsv
PXD019123/PXD019123.sdrf.tsv
PXD019185_PXD018883/PXD019185_PXD018883.sdrf.tsv
PXD019291/PXD019291.sdrf.tsv
PXD020187/PXD020187.sdrf.tsv
PXD020381/PXD020381.sdrf.tsv
PXD020394/PXD020394.sdrf.tsv
PXD023650/PXD023650.sdrf.tsv
PXD023707/PXD023707.sdrf.tsv
PXD026474/PXD026474.sdrf.tsv
PXD051889/PXD051889.sdrf.tsv

noatgnu avatar Sep 03 '25 11:09 noatgnu

I also think assay name MUST be unique because they represent a unique accession to the data file. Before, we didn't want to force it, but now, in this new and more stable format, we should make it unique. Also, we can make other columns unique, like the data file name. What do you think @noatgnu @TineClaeys @timosachsenberg @trishorts, @deeptijk @nithujohn @Copilot and others?

ypriverol avatar Sep 05 '25 09:09 ypriverol

For my work, I treat all values in this as unique so it should be the same as my current workflow. Since source name can be non-unique, we need at least one column that can be treated as unique for indexing purpose that doesn't require us to do extra column concatenation to get the correct unique label.

noatgnu avatar Sep 05 '25 09:09 noatgnu

source name needs to be unique also the the sample no?

ypriverol avatar Sep 05 '25 09:09 ypriverol

Yes. I second with all values are unique including source name and assay name. If source names are not unique, we need to index multiple columns when comparing samples.

daichengxin avatar Sep 05 '25 10:09 daichengxin

As far as I remember, source names must be unique to the sample, while assay names should be unique to the data file.

The most obvious cases for non-unique values in both columns are:

  • TMT and the like for assay name (multiple samples linked to the same assay name);
  • technical replicates for source name (multiple runs for the same sample). Other cases here: pooled samples, duplicated TMT channels, etc.

levitsky avatar Sep 05 '25 11:09 levitsky

if you have multiple replicates, you will have the same source name, different assay name: assay name is unique to the data file. This is what we are proposing: the source name + assay name combination: you will find only one on each file.

ypriverol avatar Sep 05 '25 14:09 ypriverol

The source name + assay name combination should probably always be unique, as a sample can be present in an analysis only once. I just wanted to note that either one of these columns cannot be required to be unique on its own (as the title suggests).

levitsky avatar Sep 05 '25 14:09 levitsky

Fully addressed in #750

ypriverol avatar Sep 05 '25 15:09 ypriverol

Question I had after @ypriverol talked about this during the call: it is conceivable at least in theory that multiple channels are used in a TMT experiment for the same original sample, e.g. to directly measure technical variation. I have not found a lot of references, but there is this FragPipe issue where the user describes this experimental design. It is apparently more common to do that with pooled samples.

In this case both the source name and the assay name should be the same, and the combination will not be unique. So perhaps it should not be made completely impossible to annotate such a design?

levitsky avatar Sep 14 '25 11:09 levitsky

I think they are refering to pooled samples, and in that case you have multiple samples inside the same TMT channel, if samples are technical replicates, they should use a different file name, two technical replicates I doubt end it in the same file.

@levitsky can you think about the use case they are mentioning and bringing to the discussion today PSI-MS meeting.

ypriverol avatar Sep 19 '25 06:09 ypriverol

I agree with the assay name + source name combination has to be unique, that makes the most sense. Regarding the comment from @levitsky it makes sense to maybe add an additional check for label if there would be non unique instances of assay name and source name? In the case described, it is indeed logical to have these duplicated but I'm wondering how much that occurs and if we need to adapt to these rare cases or just ensure a general support for the majority cases.

TineClaeys avatar Sep 19 '25 08:09 TineClaeys

@levitsky @noatgnu @TineClaeys we can do the combination source name, assay name and label unique, this enable to add more flexibility to enable to annotate the same sample with different TMT channels in the same file (as this extreme case). In addition this will also help to help even with the current situation. Then we have two options now:

1 - Most restrictive: source name + assay name: unique 2- Relaxed but still better: source name + assay name + label: unique

If you like 1 please 👍, if you like 2 ❤️. If you don't like them, then just do: 👎

ypriverol avatar Sep 19 '25 13:09 ypriverol

I think non-unique source+assay name should produce a warning or even an error but there should be a way to validate an SDRF for such a case (if an error is raised, there should be a way around it).

levitsky avatar Sep 19 '25 13:09 levitsky

@levitsky I think we are just defining the rule in the specification, and then the validator respond to it. based on your feddback it could be:

1- source name + assay name: unique RECOMMENDED (Warning if not) 2- source name + assay name + label: unique (Error if not)

ypriverol avatar Sep 19 '25 13:09 ypriverol

This is the final conclusion and should be implemented in #733 and the new sdrf_pipelines validator https://github.com/bigbio/sdrf-pipelines/pull/219

@levitsky I think we are just defining the rule in the specification, and then the validator respond to it. based on your feddback it could be:

1- source name + assay name: unique RECOMMENDED (Warning if not) 2- source name + assay name + label: unique (Error if not)

ypriverol avatar Sep 22 '25 05:09 ypriverol