isa-api Test Data Validation Changes

Test Data Validation Changes

Open ptth222 opened this issue 1 year ago • 1 comments

A few of the PRs I still have open and am working on have all ran into this issue, and I need some input from you guys to resolve it.

In tests/validators/test_validate_test_data.py there are tests like:

def test_validate_testdata_bii_s_3_isatab(self):
        test_case = 'BII-S-3'
        with open(os.path.join(utils.TAB_DATA_DIR, test_case, 'i_gilbert.txt')) as test_case_fp:
            report = isatab.validate(fp=test_case_fp, config_dir=utils.DEFAULT2015_XML_CONFIGS_DATA_DIR,
                                     log_level=self._reporting_level)
            if len(report['errors']) > 0:
                self.fail("Error found when validating ISA tab: {}".format(report['errors']))

In #551 though many of them changed to:

def test_validate_testdata_bii_s_3_isatab(self):
        test_case = 'BII-S-3'
        with open(os.path.join(utils.TAB_DATA_DIR, test_case, 'i_gilbert.txt')) as test_case_fp:
            report = isatab.validate(fp=test_case_fp, config_dir=utils.DEFAULT2015_XML_CONFIGS_DATA_DIR,
                                     log_level=self._reporting_level)
            if len(report['errors']) > 0:
                # self.fail("Error found when validating ISA tab: {}".format(report['errors']))
                self.assertTrue("Error found when validating ISA tab: {}".format(report['errors']))

I don't think this was a wise change. The tests changed from testing that the data had no errors, to testing that it does have errors, and doesn't specify the errors. I think a better way to handle this is either to fix the errors in that data or explicitly look for the errors in that data, so you can know if changes are adding or subtracting errors in the future. The reason this is causing me issues for my PRs is because I am trying to fix some of the validations so they are inline with other parts of the code and examples, but the code and examples don't align with each other.

If we could hash out a few things about what is correct I can change the validation and test data to be correct and restore these tests to how they used to be.

Here is an example, tests/data/tab/BII-I-1/a_proteome.txt:

Sample Name	Protocol REF	Extract Name	Protocol REF	Labeled Extract Name	Label	Term Accession Number	MS Assay Name	Comment[PRIDE Accession]	Comment[PRIDE Processed Data Accession]	Raw Spectral Data File	Normalization Name	Protein Assignment File	Peptide Assignment File	Post Translational Modification Assignment File	Data Transformation Name	Derived Spectral Data File	Factor Value[limiting nutrient]	Term Source REF	Term Accession Number	Factor Value[rate]
S-0.1-aliquot11	protein extraction	S-0.1	ITRAQ labeling	JC_S-0.1	iTRAQ reagent 117	8761	8761	8761	spectrum.mzdata	norm1	proteins.csv	peptides.csv	ptms.csv	datatransformation1	PRIDE_Exp_Complete_Ac_8761.xml	sulphur			0.1	l/hr
C-0.1-aliquot11	protein extraction	C-0.1	ITRAQ labeling	JC_C-0.1	iTRAQ reagent 116	8761	8761	8761	spectrum.mzdata	norm1	proteins.csv	peptides.csv	ptms.csv	datatransformation1	PRIDE_Exp_Complete_Ac_8761.xml	carbon			0.1	l/hr
N-0.1-aliquot11	protein extraction	N-0.1	ITRAQ labeling	JC_N-0.1	iTRAQ reagent 115	8761	8761	8761	spectrum.mzdata	norm1	proteins.csv	peptides.csv	ptms.csv	datatransformation1	PRIDE_Exp_Complete_Ac_8761.xml	nitrogen			0.1	l/hr
S-0.1-aliquot11	protein extraction	S-0.1	ITRAQ labeling	Pool1	iTRAQ reagent 114	8761	8761	8761	spectrum.mzdata	norm1	proteins.csv	peptides.csv	ptms.csv	datatransformation1	PRIDE_Exp_Complete_Ac_8761.xml		l/hr
C-0.1-aliquot11	protein extraction	C-0.1	ITRAQ labeling	Pool1	iTRAQ reagent 114	8761	8761	8761	spectrum.mzdata	norm1	proteins.csv	peptides.csv	ptms.csv	datatransformation1	PRIDE_Exp_Complete_Ac_8761.xml		l/hr
N-0.1-aliquot11	protein extraction	N-0.1	ITRAQ labeling	Pool1	iTRAQ reagent 114	8761	8761	8761	spectrum.mzdata	norm1	proteins.csv	peptides.csv	ptms.csv	datatransformation1	PRIDE_Exp_Complete_Ac_8761.xml		l/hr
C-0.2-aliquot11	protein extraction	C-0.2	ITRAQ labeling	JC_C-0.2	iTRAQ reagent 117	8762	8762	8762	spectrum.mzdata	norm2	proteins.csv	peptides.csv	ptms.csv	datatransformation2	PRIDE_Exp_Complete_Ac_8762.xml	carbon			0.2	l/hr
N-0.2-aliquot11	protein extraction	N-0.2	ITRAQ labeling	JC_N-0.2	iTRAQ reagent 116	8762	8762	8762	spectrum.mzdata	norm2	proteins.csv	peptides.csv	ptms.csv	datatransformation2	PRIDE_Exp_Complete_Ac_8762.xml	nitrogen			0.2	l/hr
P-0.1-aliquot11	protein extraction	P-0.1	ITRAQ labeling	JC_P-0.1	iTRAQ reagent 115	8762	8762	8762	spectrum.mzdata	norm2	proteins.csv	peptides.csv	ptms.csv	datatransformation2	PRIDE_Exp_Complete_Ac_8762.xml	phosphorus		0.1	l/hr
C-0.2-aliquot11	protein extraction	C-0.2	ITRAQ labeling	Pool2	iTRAQ reagent 114	8762	8762	8762	spectrum.mzdata	norm2	proteins.csv	peptides.csv	ptms.csv	datatransformation2	PRIDE_Exp_Complete_Ac_8762.xml		l/hr
N-0.2-aliquot11	protein extraction	N-0.2	ITRAQ labeling	Pool2	iTRAQ reagent 114	8762	8762	8762	spectrum.mzdata	norm2	proteins.csv	peptides.csv	ptms.csv	datatransformation2	PRIDE_Exp_Complete_Ac_8762.xml		l/hr
P-0.1-aliquot11	protein extraction	P-0.1	ITRAQ labeling	Pool2	iTRAQ reagent 114	8762	8762	8762	spectrum.mzdata	norm2	proteins.csv	peptides.csv	ptms.csv	datatransformation2	PRIDE_Exp_Complete_Ac_8762.xml		l/hr
P-0.2-aliquot11	protein extraction	P-0.2	ITRAQ labeling	JC_P-0.2	iTRAQ reagent 116	8763	8763	8763	spectrum.mzdata	norm3	proteins.csv	peptides.csv	ptms.csv	datatransformation3	PRIDE_Exp_Complete_Ac_8763.xml	phosphorus		0.2	l/hr
S-0.2-aliquot11	protein extraction	S-0.2	ITRAQ labeling	JC_S-0.2	iTRAQ reagent 115	8763	8763	8763	spectrum.mzdata	norm3	proteins.csv	peptides.csv	ptms.csv	datatransformation3	PRIDE_Exp_Complete_Ac_8763.xml	sulphur			0.2	l/hr
P-0.2-aliquot11	protein extraction	P-0.2	ITRAQ labeling	Pool3	iTRAQ reagent 117	8763	8763	8763	spectrum.mzdata	norm3	proteins.csv	peptides.csv	ptms.csv	datatransformation3	PRIDE_Exp_Complete_Ac_8763.xml		l/hr
S-0.2-aliquot11	protein extraction	S-0.2	ITRAQ labeling	Pool3	iTRAQ reagent 117	8763	8763	8763	spectrum.mzdata	norm3	proteins.csv	peptides.csv	ptms.csv	datatransformation3	PRIDE_Exp_Complete_Ac_8763.xml		l/hr
P-0.2-aliquot11	protein extraction	P-0.2	ITRAQ labeling	Pool3	iTRAQ reagent 114	8763	8763	8763	spectrum.mzdata	norm3	proteins.csv	peptides.csv	ptms.csv	datatransformation3	PRIDE_Exp_Complete_Ac_8763.xml		l/hr
S-0.2-aliquot11	protein extraction	S-0.2	ITRAQ labeling	Pool3	iTRAQ reagent 114	8763	8763	8763	spectrum.mzdata	norm3	proteins.csv	peptides.csv	ptms.csv	datatransformation3	PRIDE_Exp_Complete_Ac_8763.xml		l/hr

This has a few issues:

"Protocol REF" columns appear to be missing in 3 places, before each of the "Name" columns. The preprocess function in ProcessSequenceFactory.py warns about this.
The "Factor Value" columns are at the end instead of being after "Sample Name".
There are completely empty "Term Source REF" and "Term Accession Number" columns.

Questions about these issues:

Although missing "Protocol REF" columns will not cause the ProcessSequenceFactory to error, it adds the columns with "unknown" for the value, I think this should still be at least a warning in validation. I don't fully understand what is going on here. Was there a time where the "Name" columns was all that was necessary and you didn't need to specify the protocol? It would be nice to have some consensus about this and make the code and examples align with that.
The ProcessSequenceFactory handles "Factor Values" in a unique way. It splits columns into object groups and then loops over the object groups to parse them into model objects. For things like "Characteristics" and "Parameters" it only looks for them within the object group, but for "Factor Values" it will look for them within the entire data frame. This behavior assumes or suggests that there can only be 1 "Sample Name" column since it adds all factors in the entire file every time it finds a "Sample Name" column. This is correct for assays, but can studies only have 1 "Sample Name" column? This ProcessSequenceFactory also suggests that "Factor Value" columns can be anywhere, but this makes validation more difficult. I would recommend requiring "Factor Values" to be after the "Sample Name" column just like "Characteristics" to make things more consistent.
Completely empty "Term Source REF" and "Term Accession Number" columns doesn't cause any errors, but it kind of suggests that at one time they were required maybe? If we end up editing the data I would like to know whether we could just remove these or not.

I have also come across some other issues while investigating this that aren't illustrated in the example.

Can there be multiple "Name" columns, such as "Assay Name"? If so then they need to be matched a little differently in some of the code. Currently, they are matched using == instead of startswith like "Protocol REF" columns are.
Is order supposed to matter for "Term Source REF" and "Term Accession Number"? As well as "Unit", "Term Source REF", "Term Accession Number"? They do in the code, get_value used by ProcessSequenceFactory requires a specific order, but is it supposed to be that way? Also all of the columns must be present, you can't just have a "Unit" column for instance. I have written some preliminary code that changes both of these conditions. It would make it so if just "Unit" is present it still gets added and if they are in any order it still gets created.

If it is easier to meet and discuss this I am available.

Mar 29 '24 00:03 ptth222

TODO: change the test from: self.assertTrue("Error found when validating ISA tab: {}".format(report['errors'])) to: self.fail("Error found when validating ISA tab: {}".format(report['errors']))

and fix the test files in response to errors detected.

Apr 19 '24 14:04 proccaserra

isa-api isa-api copied to clipboard

Test Data Validation Changes

isa-api
isa-api copied to clipboard