cell-ontology
cell-ontology copied to clipboard
Map all PRO terms used in CL to uniprot (where possible).
We need to be able to map PRO terms used by CL to something the rest of the world can use. I think that means uniprot. Xrefs to uniprot are rare:
https://api.triplydb.com/s/tuAThwx4i
We mostly have xrefs to
- PIR - which often has the mappings we need, but AFAIK has no API - so we'd need to scrape?
- IUPHAR - need to research how we might use this.
Where we can't map based on ID, I think we may need to resort to lexical mapping. One option for this is GILDA.
@addiehl - any other suggestions based on your prior work on these + other linked resources?
@cmungall - any suggestions for strategy?
It might be useful to ask Darren @nataled
I'll overlook the "something the rest of the world can use" comment ;)
The results of that SPARQL query fall into two types:
-
The xref points to a protein family. These are cases where the PRO term was created on the basis of the indicated xref at the time the term was created. Prefixes include: PIRSF: https://proteininformationresource.org/cgi-bin/ipcSF?id= PANTHER: http://www.pantherdb.org/panther/family.do?clsAccession= IUPHARfam: http://www.guidetopharmacology.org/GRAC/FamilyDisplayForward?familyId= IUPHARobj: http://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=
-
The xref points to a specific protein or proteoform. For all these, the DTO and Reactome xrefs are superfluous in that they also have a UniProtKB xref. Prefixes include: UniProtKB: http://purl.uniprot.org/uniprot/ DTO: http://www.drugtargetontology.org/dto/DTO_ Reactome: http://www.reactome.org/content/detail/
For the first set, no single UniProtKB mapping is appropriate. Are you trying to obtain all the possible UniProtKB entries pertinent to those xrefs?
@nataled - many thanks for the details.
Various uses. In general including IDs that bioinformaticians are familiar with opens up more possibilities for them to use markers recorded in CL in their analyses.
More specifically, we're working on a Cell Type knowledge base with a focus on cell markers in human and mouse. We have other sources of known and potential markers - curated and computed. I'd like to find some way to fold in curated cell surface markers from CL.
It looks to me like in most cases 'family' here means a general term for the gene across species.
i | pro_label | PRO ID | xref |
---|---|---|---|
1 | "CD19 molecule"^^http://www.w3.org/2001/XMLSchema#string | obo:PR_000001002 | "IUPHARobj:2764"^^http://www.w3.org/2001/XMLSchema#string |
2 | "CD19 molecule"^^http://www.w3.org/2001/XMLSchema#string | obo:PR_000001002 | "PIRSF:PIRSF016630"^^http://www.w3.org/2001/XMLSchema#string |
It also looks like we could pull the mouse and human uniprot IDs from the PIR pages: https://proteininformationresource.org/cgi-bin/ipcSF?id=PIRSF016630. Is there an API option? If not we will scrape. This will work for our KB plans. I think also useful to include these IDs in CL under some AP.
Seems we can use the structure of PRO to extract many of these, e.g.
https://api.triplydb.com/s/WGSZidIVe
PRO - CL Marker | Mouse specific subclass | mouse xref | |
---|---|---|---|
ADP-ribosyl cyclase/cyclic ADP-ribose hydrolase 1 | ADP-ribosyl cyclase/cyclic ADP-ribose hydrolase 1 (mouse) | UniProtKB:P56528 | |
B-cell lymphoma 6 protein | B-cell lymphoma 6 protein homolog (mouse) | UniProtKB:P41183 | |
B-cell receptor CD22 | B-cell receptor CD22 (mouse) | UniProtKB:P35329 | |
C-C chemokine receptor type 1 | C-C chemokine receptor type 1 (mouse) | UniProtKB:P51675 | |
C-C chemokine receptor type 2 | C-C chemokine receptor type 2 (mouse) | UniProtKB:P51683 |
The subclasses are not (currently ) in the import & even if they were, we should still find some way to better support bioinformatician users. From looking at the numbers, this won't work in every case, but is a good start.
Suggested mechanism to extract:
For all PRO terms used as markers for CL terms:
- Look for uniprot xref
- If no uniprot xref: Find immediate subclasses for mouse and human & extract uniprot refs. Assumption is that direct subclasses will link to record for the protein in general ("representative isoform"?) rather than specific isoforms.
- ... some other strategy for remaining terms.
TBD: Accessible representation in CL.
CC @AvolaAmg
Yes, I believe most of the pr terms used in cl are category=gene and follow a stereotypical text definition marking them as the product of the reflexive ontolog of the human gene
Eg https://www.ebi.ac.uk/ols4/ontologies/pr/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FPR_000001408?lang=en
Ideally pro would have logical definitions for these, which would make tracing back easier. Should be easy to do this via string matching but ideally this would be done upstream of pro
Another idea would be pro releases sssom with inferred downward mappings for all category=gene
On Sat, Feb 24, 2024 at 8:51 AM David Osumi-Sutherland < @.***> wrote:
Seems we can use the structure of PRO to extract many of these, e.g.
https://api.triplydb.com/s/WGSZidIVe
The subclasses are not (currently ) in the import & even if the were, we should still find some way to better support bioinformatician users.
— Reply to this email directly, view it on GitHub https://github.com/obophenotype/cell-ontology/issues/2293#issuecomment-1962421169, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOOXDNT2JQPQVXSAID3YVIK7HAVCNFSM6AAAAABDXDR2HWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSGQZDCMJWHE . You are receiving this because you were mentioned.Message ID: @.***>
The file containing PIRSF membership can be found at https://proteininformationresource.org/projects/pirsf/. Note that the identifiers in this file don't contain 'PIR' (so, 'SF001234' instead of 'PIRSF001234'). This file goes beyond human and mouse, if that's what you need. If you only want human and mouse, then you can use our 'descendants' API for PRO:
https://lod.proconsortium.org/api.html#/DAG/getDescendantByProIDs
which is part of a larger set of APIs given here:
https://lod.proconsortium.org/api.html
You'll want to focus on the terms with local IDs that have UniProtKB accessions without a dash.
This issue has not seen any activity in the past 6 months; it will be closed automatically in one year from now if no action is taken.