isa-api
isa-api copied to clipboard
Test Data Validation Changes
A few of the PRs I still have open and am working on have all ran into this issue, and I need some input from you guys to resolve it.
In tests/validators/test_validate_test_data.py there are tests like:
def test_validate_testdata_bii_s_3_isatab(self):
test_case = 'BII-S-3'
with open(os.path.join(utils.TAB_DATA_DIR, test_case, 'i_gilbert.txt')) as test_case_fp:
report = isatab.validate(fp=test_case_fp, config_dir=utils.DEFAULT2015_XML_CONFIGS_DATA_DIR,
log_level=self._reporting_level)
if len(report['errors']) > 0:
self.fail("Error found when validating ISA tab: {}".format(report['errors']))
In #551 though many of them changed to:
def test_validate_testdata_bii_s_3_isatab(self):
test_case = 'BII-S-3'
with open(os.path.join(utils.TAB_DATA_DIR, test_case, 'i_gilbert.txt')) as test_case_fp:
report = isatab.validate(fp=test_case_fp, config_dir=utils.DEFAULT2015_XML_CONFIGS_DATA_DIR,
log_level=self._reporting_level)
if len(report['errors']) > 0:
# self.fail("Error found when validating ISA tab: {}".format(report['errors']))
self.assertTrue("Error found when validating ISA tab: {}".format(report['errors']))
I don't think this was a wise change. The tests changed from testing that the data had no errors, to testing that it does have errors, and doesn't specify the errors. I think a better way to handle this is either to fix the errors in that data or explicitly look for the errors in that data, so you can know if changes are adding or subtracting errors in the future. The reason this is causing me issues for my PRs is because I am trying to fix some of the validations so they are inline with other parts of the code and examples, but the code and examples don't align with each other.
If we could hash out a few things about what is correct I can change the validation and test data to be correct and restore these tests to how they used to be.
Here is an example, tests/data/tab/BII-I-1/a_proteome.txt:
Sample Name | Protocol REF | Extract Name | Protocol REF | Labeled Extract Name | Label | Term Source REF | Term Accession Number | MS Assay Name | Comment[PRIDE Accession] | Comment[PRIDE Processed Data Accession] | Raw Spectral Data File | Normalization Name | Protein Assignment File | Peptide Assignment File | Post Translational Modification Assignment File | Data Transformation Name | Derived Spectral Data File | Factor Value[limiting nutrient] | Term Source REF | Term Accession Number | Factor Value[rate] | Unit | Term Source REF | Term Accession Number |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
S-0.1-aliquot11 | protein extraction | S-0.1 | ITRAQ labeling | JC_S-0.1 | iTRAQ reagent 117 | 8761 | 8761 | 8761 | spectrum.mzdata | norm1 | proteins.csv | peptides.csv | ptms.csv | datatransformation1 | PRIDE_Exp_Complete_Ac_8761.xml | sulphur | 0.1 | l/hr | ||||||
C-0.1-aliquot11 | protein extraction | C-0.1 | ITRAQ labeling | JC_C-0.1 | iTRAQ reagent 116 | 8761 | 8761 | 8761 | spectrum.mzdata | norm1 | proteins.csv | peptides.csv | ptms.csv | datatransformation1 | PRIDE_Exp_Complete_Ac_8761.xml | carbon | 0.1 | l/hr | ||||||
N-0.1-aliquot11 | protein extraction | N-0.1 | ITRAQ labeling | JC_N-0.1 | iTRAQ reagent 115 | 8761 | 8761 | 8761 | spectrum.mzdata | norm1 | proteins.csv | peptides.csv | ptms.csv | datatransformation1 | PRIDE_Exp_Complete_Ac_8761.xml | nitrogen | 0.1 | l/hr | ||||||
S-0.1-aliquot11 | protein extraction | S-0.1 | ITRAQ labeling | Pool1 | iTRAQ reagent 114 | 8761 | 8761 | 8761 | spectrum.mzdata | norm1 | proteins.csv | peptides.csv | ptms.csv | datatransformation1 | PRIDE_Exp_Complete_Ac_8761.xml | l/hr | ||||||||
C-0.1-aliquot11 | protein extraction | C-0.1 | ITRAQ labeling | Pool1 | iTRAQ reagent 114 | 8761 | 8761 | 8761 | spectrum.mzdata | norm1 | proteins.csv | peptides.csv | ptms.csv | datatransformation1 | PRIDE_Exp_Complete_Ac_8761.xml | l/hr | ||||||||
N-0.1-aliquot11 | protein extraction | N-0.1 | ITRAQ labeling | Pool1 | iTRAQ reagent 114 | 8761 | 8761 | 8761 | spectrum.mzdata | norm1 | proteins.csv | peptides.csv | ptms.csv | datatransformation1 | PRIDE_Exp_Complete_Ac_8761.xml | l/hr | ||||||||
C-0.2-aliquot11 | protein extraction | C-0.2 | ITRAQ labeling | JC_C-0.2 | iTRAQ reagent 117 | 8762 | 8762 | 8762 | spectrum.mzdata | norm2 | proteins.csv | peptides.csv | ptms.csv | datatransformation2 | PRIDE_Exp_Complete_Ac_8762.xml | carbon | 0.2 | l/hr | ||||||
N-0.2-aliquot11 | protein extraction | N-0.2 | ITRAQ labeling | JC_N-0.2 | iTRAQ reagent 116 | 8762 | 8762 | 8762 | spectrum.mzdata | norm2 | proteins.csv | peptides.csv | ptms.csv | datatransformation2 | PRIDE_Exp_Complete_Ac_8762.xml | nitrogen | 0.2 | l/hr | ||||||
P-0.1-aliquot11 | protein extraction | P-0.1 | ITRAQ labeling | JC_P-0.1 | iTRAQ reagent 115 | 8762 | 8762 | 8762 | spectrum.mzdata | norm2 | proteins.csv | peptides.csv | ptms.csv | datatransformation2 | PRIDE_Exp_Complete_Ac_8762.xml | phosphorus | 0.1 | l/hr | ||||||
C-0.2-aliquot11 | protein extraction | C-0.2 | ITRAQ labeling | Pool2 | iTRAQ reagent 114 | 8762 | 8762 | 8762 | spectrum.mzdata | norm2 | proteins.csv | peptides.csv | ptms.csv | datatransformation2 | PRIDE_Exp_Complete_Ac_8762.xml | l/hr | ||||||||
N-0.2-aliquot11 | protein extraction | N-0.2 | ITRAQ labeling | Pool2 | iTRAQ reagent 114 | 8762 | 8762 | 8762 | spectrum.mzdata | norm2 | proteins.csv | peptides.csv | ptms.csv | datatransformation2 | PRIDE_Exp_Complete_Ac_8762.xml | l/hr | ||||||||
P-0.1-aliquot11 | protein extraction | P-0.1 | ITRAQ labeling | Pool2 | iTRAQ reagent 114 | 8762 | 8762 | 8762 | spectrum.mzdata | norm2 | proteins.csv | peptides.csv | ptms.csv | datatransformation2 | PRIDE_Exp_Complete_Ac_8762.xml | l/hr | ||||||||
P-0.2-aliquot11 | protein extraction | P-0.2 | ITRAQ labeling | JC_P-0.2 | iTRAQ reagent 116 | 8763 | 8763 | 8763 | spectrum.mzdata | norm3 | proteins.csv | peptides.csv | ptms.csv | datatransformation3 | PRIDE_Exp_Complete_Ac_8763.xml | phosphorus | 0.2 | l/hr | ||||||
S-0.2-aliquot11 | protein extraction | S-0.2 | ITRAQ labeling | JC_S-0.2 | iTRAQ reagent 115 | 8763 | 8763 | 8763 | spectrum.mzdata | norm3 | proteins.csv | peptides.csv | ptms.csv | datatransformation3 | PRIDE_Exp_Complete_Ac_8763.xml | sulphur | 0.2 | l/hr | ||||||
P-0.2-aliquot11 | protein extraction | P-0.2 | ITRAQ labeling | Pool3 | iTRAQ reagent 117 | 8763 | 8763 | 8763 | spectrum.mzdata | norm3 | proteins.csv | peptides.csv | ptms.csv | datatransformation3 | PRIDE_Exp_Complete_Ac_8763.xml | l/hr | ||||||||
S-0.2-aliquot11 | protein extraction | S-0.2 | ITRAQ labeling | Pool3 | iTRAQ reagent 117 | 8763 | 8763 | 8763 | spectrum.mzdata | norm3 | proteins.csv | peptides.csv | ptms.csv | datatransformation3 | PRIDE_Exp_Complete_Ac_8763.xml | l/hr | ||||||||
P-0.2-aliquot11 | protein extraction | P-0.2 | ITRAQ labeling | Pool3 | iTRAQ reagent 114 | 8763 | 8763 | 8763 | spectrum.mzdata | norm3 | proteins.csv | peptides.csv | ptms.csv | datatransformation3 | PRIDE_Exp_Complete_Ac_8763.xml | l/hr | ||||||||
S-0.2-aliquot11 | protein extraction | S-0.2 | ITRAQ labeling | Pool3 | iTRAQ reagent 114 | 8763 | 8763 | 8763 | spectrum.mzdata | norm3 | proteins.csv | peptides.csv | ptms.csv | datatransformation3 | PRIDE_Exp_Complete_Ac_8763.xml | l/hr |
This has a few issues:
- "Protocol REF" columns appear to be missing in 3 places, before each of the "Name" columns. The
preprocess
function in ProcessSequenceFactory.py warns about this. - The "Factor Value" columns are at the end instead of being after "Sample Name".
- There are completely empty "Term Source REF" and "Term Accession Number" columns.
Questions about these issues:
- Although missing "Protocol REF" columns will not cause the
ProcessSequenceFactory
to error, it adds the columns with "unknown" for the value, I think this should still be at least a warning in validation. I don't fully understand what is going on here. Was there a time where the "Name" columns was all that was necessary and you didn't need to specify the protocol? It would be nice to have some consensus about this and make the code and examples align with that. - The
ProcessSequenceFactory
handles "Factor Values" in a unique way. It splits columns into object groups and then loops over the object groups to parse them into model objects. For things like "Characteristics" and "Parameters" it only looks for them within the object group, but for "Factor Values" it will look for them within the entire data frame. This behavior assumes or suggests that there can only be 1 "Sample Name" column since it adds all factors in the entire file every time it finds a "Sample Name" column. This is correct for assays, but can studies only have 1 "Sample Name" column? ThisProcessSequenceFactory
also suggests that "Factor Value" columns can be anywhere, but this makes validation more difficult. I would recommend requiring "Factor Values" to be after the "Sample Name" column just like "Characteristics" to make things more consistent. - Completely empty "Term Source REF" and "Term Accession Number" columns doesn't cause any errors, but it kind of suggests that at one time they were required maybe? If we end up editing the data I would like to know whether we could just remove these or not.
I have also come across some other issues while investigating this that aren't illustrated in the example.
- Can there be multiple "Name" columns, such as "Assay Name"? If so then they need to be matched a little differently in some of the code. Currently, they are matched using
==
instead ofstartswith
like "Protocol REF" columns are. - Is order supposed to matter for "Term Source REF" and "Term Accession Number"? As well as "Unit", "Term Source REF", "Term Accession Number"? They do in the code,
get_value
used byProcessSequenceFactory
requires a specific order, but is it supposed to be that way? Also all of the columns must be present, you can't just have a "Unit" column for instance. I have written some preliminary code that changes both of these conditions. It would make it so if just "Unit" is present it still gets added and if they are in any order it still gets created.
If it is easier to meet and discuss this I am available.
TODO: change the test from: self.assertTrue("Error found when validating ISA tab: {}".format(report['errors'])) to: self.fail("Error found when validating ISA tab: {}".format(report['errors']))
and fix the test files in response to errors detected.