specifications icon indicating copy to clipboard operation
specifications copied to clipboard

Protein Profile for sites that mix species data

Open AlasdairGray opened this issue 5 years ago • 14 comments

Protein profile assumes a single protein for a single species. Sites such as Guide to Pharmacology focus more on the interaction rather than the protein. Thus their page on A1 Receptor includes data about 3 species. While the data can be separated into the different species, there would only be one page identifier.

How should we properly model this with the Protein Profile?

AlasdairGray avatar Aug 16 '19 13:08 AlasdairGray

A first approach could be just to model a single species. Another approach could be to create sub-identifiers of the form https://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=18#homosapien

AlasdairGray avatar Aug 16 '19 13:08 AlasdairGray

Another option is just to model the protein and not define a species.

AlasdairGray avatar Nov 28 '19 13:11 AlasdairGray

taxonomicRange is recommended and takes MANY. Why is that MANY not enough to cover this case? Do you have an example?

ljgarcia avatar Nov 28 '19 15:11 ljgarcia

@simondharding I have created a first version of the GtP markup for a Protein. This should conform to the 0.9-DRAFT of the Protein Profile. There are more properties in the profile that can be used to markup the other data that you have on the page. Feel free to extend the example with these properties. You probably need to do it in the species specific sub-parts. For some of the other receptor families, e.g. Enzymes, we can use other profiles.

@ljgarcia it would be good to get your opinion on the modelling. I have created a species-less main entity and then used the hasBioChemEntityPart property to add in the three species that they have in the database. Also, what is the property to use to link a protein to its protein family, and what type should the protein family page have? Is it just a protein?

AlasdairGray avatar Nov 28 '19 16:11 AlasdairGray

@AlasdairGray thanks, this looks promising. We should include the UniProt ID for the protein (for each species) - which property should I use to do this? sameAs ? https://www.uniprot.org/uniprot/P30542

simondharding avatar Dec 05 '19 15:12 simondharding

@AlasdairGray Is this "part" a BioChemEntity? { "isEncodedByBioChemEntity": { "@type": "Gene", "name": "adenosine A1 receptor", "identifier": "ADORA1", "hasRepresentation": "1q32.1" }, "taxonomicRange": { "@id": "https://identifiers.org/taxonomy:9606", "@type": "Taxon", "name": "Human" } }, Given the range of hasBioChemEntityPart I am guessing, yes. If so, why is the type not included? I am also guessing these parts are Protein, again, the type should be used. One disadvantage here is the lack of "@id" for the protein parts, so no link the any actual entity.

@simondharding If those parts indeed correspond to a UniProt entry, you could directly use it in your markup, this would solve the protein part problem regarding "@id" "hasBioChemEntityPart": [ { "@id": "http://purl.uniprot.org/uniprot/P30542" } ]

ljgarcia avatar Dec 06 '19 08:12 ljgarcia

@AlasdairGray @ljgarcia so something like this; { "@context": "http://schema.org", "@type": "DataRecord", "@id": "https://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=18#", "includedInDataset": "https://www.guidetopharmacology.org/index.jsp#dataset", "citation": { "@id": "https://doi.org/10.2218/gtopdb/F3/2019.4", "@type": "ScholarlyPublication" }, "mainEntity": { "@id": "https://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=18", "@type": "Protein", "http://purl.org/dc/terms/conformsTo": "https://bioschemas.org/specifications/Protein/0.9-DRAFT", "identifier": "18", "name": "A1 receptor", "description": "class A G protein-coupled receptor", "alternateName": ["RDC7", "adenosine receptor A1", "A1-AR", "A1R"], "url": "https://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=18", "hasBioChemEntityPart": [ { "@id": "http://purl.uniprot.org/uniprot/P30542" }, { "@id": "http://purl.uniprot.org/uniprot/Q60612" }, { "@id": "http://purl.uniprot.org/uniprot/P25099" } ] } }

simondharding avatar Dec 06 '19 09:12 simondharding

Good point @ljgarcia about the lack of type and identifier for the subparts. The two options would be:

  1. Directly using UniProt
...
"hasBioChemEntityPart": [
  { 
    "@id": "http://purl.uniprot.org/uniprot/P30542",  
    "@type": "Protein"
  },
  { 
    "@id": "http://purl.uniprot.org/uniprot/Q60612",  
    "@type": "Protein"
  },
  { 
    "@id": "http://purl.uniprot.org/uniprot/P25099",  
    "@type": "Protein"
  }
],
...
  1. Using sameAs link
...
"hasBioChemEntityPart": [
      {
        "@type": "Protein",
        "sameAs": "http://purl.uniprot.org/uniprot/P30542",
        "isEncodedByBioChemEntity": {
          "@type": "Gene",
          "name": "adenosine A1 receptor",
          "identifier": "ADORA1",
          "hasRepresentation": "1q32.1"
        },
        "taxonomicRange": {
          "@id": "https://identifiers.org/taxonomy:9606",
          "@type": "Taxon",
          "name": "Human"
        }
      },
      {
        "@type": "Protein",
        "sameAs": "http://purl.uniprot.org/uniprot/Q60612",
        "isEncodedByBioChemEntity": {
          "@type": "Gene",
          "name": "adenosine A1 receptor",
          "identifier": "Adora1",
          "hasRepresentation": "1 E4"
        },
        "taxonomicRange": {
          "@id": "https://identifiers.org/taxonomy:10090",
          "@type": "Taxon",
          "name": "Mouse"
        }
      },
      {
        "@type": "Protein",
        "sameAs": "http://purl.uniprot.org/uniprot/P25099",
        "isEncodedByBioChemEntity": {
          "@type": "Gene",
          "name": "adenosine A1 receptor",
          "identifier": "Adora1",
          "hasRepresentation": "13q13"
        },
        "taxonomicRange": {
          "@id": "https://identifiers.org/taxonomy:10114",
          "@type": "Taxon",
          "name": "Rat"
        }
      }
    ]
    ...

At this point, UniProt does not have Bisochemas markup, so the second approach means that there will be data available for the construction of the knowledge graph. The first approach gives a more direct link to UniProt, but means that GtP are not making any assertions about the data.

AlasdairGray avatar Dec 06 '19 10:12 AlasdairGray

@AlasdairGray do HGNC have bioschemas mark-up? I wonder if the @type Gene should include the HGNC ID and likewise the mouse and rat MGI IDs and RGD IDs. Rather than the gene symbol as the identifier.

simondharding avatar Dec 06 '19 14:12 simondharding

As @AlasdairGray suggests, having the sameAs would link to UniProt and would also provide the data. Once UniProt supports bioschemas markup, it could be removed to avoid duplication.

@simondharding Same as it is done with UniProt proteins, it can also be done with Genes. At https://bioschemas.org/liveDeploys/ I do not see HGNC so an approach similar to the one suggested for UniProt would be the way to go by now. The identifier could still be the gene symbol: if you actually use it as identifier or if HGNC uses as identifier (as this seems to be your reference database for Genes). Ensembl ID could also be a possibility for gene ids.

ljgarcia avatar Dec 09 '19 16:12 ljgarcia

Hi @AlasdairGray @ljgarcia I've got the following prepared for the target page on GtoPdb. I've included the proteins and genes all under "hasBioChemEntityPart" . Ideally, I'd use the isEncodedByBioChemEntity as a subclause for each protein. But there are cases where more than one gene and protein per species are included on our target pages. Happy to discuss. But useful to know how this looks.

<!-- BioSchemas Mark-Up For Targets -->
        <script type="application/ld+json">  
            {
                "@context": "http://schema.org", 
                "@id": "https://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=19#",
                "@type": "DataRecord",
                "includedInDataset": {
                    "@type": "Dataset",
                    "@id": "https://www.guidetopharmacology.org/index.jsp#dataset"
                },
                "citation": {
                    "@id": "",
                    "@type": "ScholarlyPublication"
                },
                "mainEntity": {
                "@id": "https://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=19",
                "@type": "Protein",
                "http://purl.org/dc/terms/conformsTo": "https://bioschemas.org/specifications/Protein/0.9-DRAFT",
                "identifier": "19",
                "name": "A<sub>2A</sub> receptor",
                "description": "A<sub>2A</sub> receptor",
                "alternateName": ["RDC8","A2-AR","adenosine receptor A2a"],
                "url": "https://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=19",

                "hasBioChemEntityPart": [
                {
                        "@type": "Protein",
                        "sameAs": "https://www.uniprot.org/uniprot/P29274",
                        "taxonomicRange": {
                            "@id": "https://identifiers.org/taxonomy:9606",
                            "@type": "Taxon",
                            "name": "Human"
                        }
                        },
                {
                        "@type": "Protein",
                        "sameAs": "https://www.uniprot.org/uniprot/Q60613",
                        "taxonomicRange": {
                            "@id": "https://identifiers.org/taxonomy:10090",
                            "@type": "Taxon",
                            "name": "Mouse"
                        }
                        },
                {
                        "@type": "Protein",
                        "sameAs": "https://www.uniprot.org/uniprot/P30543",
                        "taxonomicRange": {
                            "@id": "https://identifiers.org/taxonomy:10116",
                            "@type": "Taxon",
                            "name": "Rat"
                        }
                        },
                {
                            "@type": "Gene",
                            "sameAs": "https://rgd.mcw.edu/rgdweb/report/gene/main.html?id=2049",
                            "name": "Adora2a",
                            "identifier": "2049",
                            "hasRepresentation": "20p12"
                        },
                        "taxonomicRange": {
                            "@id": "https://identifiers.org/taxonomy:10116",
                            "@type": "Taxon",
                            "name": "Rat"
                        },
                {
                            "@type": "Gene",
                            "sameAs": "https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:263",
                            "name": "ADORA2A",
                            "identifier": "263",
                            "hasRepresentation": "22q11.23"
                        },
                        "taxonomicRange": {
                            "@id": "https://identifiers.org/taxonomy:9606",
                            "@type": "Taxon",
                            "name": "Human"
                        },
      {
                            "@type": "Gene",
                            "sameAs": "http://www.informatics.jax.org/marker/MGI:99402",
                            "name": "Adora2a",
                            "identifier": "MGI:99402",
                            "hasRepresentation": "10"
                        },
                        "taxonomicRange": {
                            "@id": "https://identifiers.org/taxonomy:10090",
                            "@type": "Taxon",
                            "name": "Mouse"
                        }
                ]        
                }
            }
        </script>
<!-- END OF BioSchemas Mark-Up -->

simondharding avatar Dec 20 '19 10:12 simondharding

Hi @simondharding

It looks good although having both proteins and genes as targets for hasBioChemEntityPart seems odd (to me).

If you add isEncodedByBioChemEntity to the UniProt proteins, and that points to genes, will you still need the genes as targets of hasBioChemEntityPart?

Also, not sure what you mean by "But there are cases where more than one gene and protein per species are included on our target pages". If that is adding more of your proteins to the mainEntity, then using a list would solve it. If that is adding more proteins/genes to the hasBioChemEntityPart, I am not sure why this would be an issue.

Cheers,

ljgarcia avatar Jan 08 '20 17:01 ljgarcia

So if I have a specific human gene and I want to link the identifiers for all the homologous and orthologous genes, I would model it using 'hasBioChemEntityPart. Then if I want to link the gene to exons of the human gene, it would also be modeled using 'hasBioChemEntityPart'. Similarly, the 'hasBioChemEntityPart' would be used to link both homologous proteins and protein subdomains to a protein. Did I understand this correctly? It feels a little confusing to me to mix actual biochemical parts (exons, subdomains) with the homolog (complete gene/protein in other species).

gtsueng avatar Aug 10 '22 14:08 gtsueng

sdo:sameAs is not appropriate for all use case although some may choose to use this. Looking at the properties in BioChemEntity the best we currently have would be bioChemSimilarity.

It may be that we want to think about proposing a new property for this case.

AlasdairGray avatar Aug 18 '22 15:08 AlasdairGray