subcellular_localization
subcellular_localization copied to clipboard
DeepLoc vs. current SwissProt presents several inconsitenceis
Hi,
I'm trying to replicate your dataset (as could be downloaded from here: http://www.cbs.dtu.dk/services/DeepLoc/data.php) but using current SwissProt instead.
- I download the most recent SwissProt version
- I filter by ECO:0000269 -- experimental evidence used in manual assertion -- EXP
Now the issues start:
There is no convinent way of mapping to your "locations" from "sublocations" as in table 1 on https://academic.oup.com/bioinformatics/article/33/21/3387/3931857
Additional to this table, a CSV (or excel) file would have been nice. Something like:
DeepLoc | SwissProt | SwissProt Ontology |
---|---|---|
Cell.membrane | Apical cell membrane | SL-0015 |
Cell.membrane | Apicolateral cell membrane | SL-0017 |
I instead tried 2 things:
Merge SwissProt and DeepLoc annotations by means of the accession numer:
Since there will be some proteins with multiple swissprot and/or multiple deeploc annotations, this will result in something like:
swissprot | deeploc_0 | deeploc_1 | deeploc_2 | deeploc_3 | deeploc_4 | deeploc_5 | deeploc_6 | deeploc_7 | deeploc_8 |
---|---|---|---|---|---|---|---|---|---|
Cytoplasm | Cytoplasm | Nucleus | Peroxisome | Cell.membrane | Mitochondrion | Extracellular | Endoplasmic.reticulum | Lysosome/Vacuole | Golgi.apparatus |
where I then have to manually select which "deeploc_X" is the correct mapping from swissprot. Unfortunately, this procedure higlighted some inconsitencies, for example: SwissProt localizations never mentioned in Table 1, but associated to one or more DeepLoc localizations. An exerpt of things that didn't quite look right:
SwissProt | DeepLoc(s) |
---|---|
Cleavage furrow | Extracellular |
Cytoplasmic granule lumen | Extracellular |
Glycosome | Peroxisome |
Sarcoplasmic reticulum lumen | Extracellular |
Recycling endosome | Cell.membrane, Cytoplasm |
Recycling endosome membrane | Cell.membrane |
Cell surface | Extracellular |
Cytoplasmic granule | Plastid, Nucleus, Cytoplasm |
Cytoplasmic granule lumen | Extracellular |
While things like "Glycosome" being "Peroxisome" is not a big deal, this was never mentioned in Table 1. It could derive from the version difference between SwissProt today vs. 2016, but worth mentioning. Other localizations seem far fetched (e.g. "Cytoplasmic granule lumen" == "Extracellular").
Filter SwissProt and DeepLoc for proteins with single localizations (each set separately), then merge by means of accession numbers
The idea behind this was to have a single, uneqivocal mapping from SwissProt to DeepLoc. Unforutnately, this highlighted some other inconsistencies:
SwissProt | DeepLoc(s) |
---|---|
Cytoplasm | Cytoplasm, Nucleus, Peroxisome |
Endoplasmic reticulum | Endoplasmic.reticulum, Nucleus |
Mitochondrion | Mitochondrion, Extracellular |
Nucleus | Nucleus, Cytoplasm |
Peroxisome | Peroxisome, Cytoplasm |
Plastid | Plastid, Cytoplasm, Endoplasmic.reticulum |
In this case, there shouldn't be more than one mapping. What this suggests is that there are proteins marked as "Plastid" in SwissProt, but marked as either "Plastid", "Cytoplasm" or "Endoplasmic.reticulum" in DeepLoc. While this might be a natural evolution of better curation in SwissProt, it highlights that the DeepLoc set as on the webpage is ultimately not up-to-date, but in the absence of a clear, unequivocal mapping from SwissProt Loc names to DeepLoc, it's virtually not possible to get a new "DeepLoc" training set.
EDIT: Procedure up until now detailed: https://github.com/sacdallago/deeploc_redo
Hi Christian,
Yes you are right that due to the evolution of the annotations in UniProt, what in 2016 had one annotation now might have a different one. For example, it might had been added more experimental annotations to one protein leading to more than one experimental localization for what before was a single localization protein.
What do you want to achieve exactly? The same DeepLoc training set with up-to-date annotations?
Hi @JJAlmagro , thanks for getting back to me so quickly :) Hope you are doing well!
What do you want to achieve exactly? The same DeepLoc training set with up-to-date annotations?
Yes. The goal would be to re-create the DeepLoc type training(&testing) set from current SwissProt. I just realized from the statistics I got out over the weekend that while the distributions look similar to what you have in the paper, I get way higher numbers probably because I don't know how you programatically removed incomplete sequences (and because the set is not yet redundancy reduced!). Would be great if you had any scripts lying around that you used for the filtering of SwissProt before redundancy reduction, or before splitting into train/test :)
FYI my current numbers (again: no sequence length filter; no incomplete sequence filter; no redundancy reduction; all swissprot but with mapping as MANUAL_MAP
here) :
Nucleus,10653
Cytoplasm,10335
Extracellular,6725
Cell.membrane,6119
Mitochondrion,2740
Endoplasmic.reticulum,2185
Lysosome/Vacuole,1482
Golgi.apparatus,1328
Plastid,1197
Peroxisome,300