mygene.info icon indicating copy to clipboard operation
mygene.info copied to clipboard

import orthology data from AGR

Open andrewsu opened this issue 2 years ago • 10 comments

The Alliance of Genome Resources (AGR) is a consortium of the most highly-used model organisms (mouse, rat, worm, fly, zebrafish, yeast), and they have released their own set of ortholog assignments at https://www.alliancegenome.org/downloads#orthology. Currently we have homologene in mygene.info, but I think this AGR set would be considered much more reliable for the organisms listed.

andrewsu avatar Aug 12 '21 16:08 andrewsu

The parser and manifest repo is here . A few tests queries can be found in this notebook.

The repo was passed along for final deployment into mygene API, will update status when it is available.

NikkiBytes avatar Oct 12 '21 16:10 NikkiBytes

Commits imported, see https://github.com/biothings/mygene.info/commit/360b97162dcb16eb4f4eb4e1e73e278b9045dcf4 and the commits prior to this one. Subject of each commit has been prepended with the text "orthologyAGR:". Will close the issue after successfully running and importing the data into the builds.

zcqian avatar Oct 13 '21 20:10 zcqian

Hi @NikkiBytes Can you add metadata to the plugin so that it correctly reports the origin and license of the data?

You can see how it's done here and here

zcqian avatar Oct 19 '21 22:10 zcqian

@zcqian it's added!

NikkiBytes avatar Oct 20 '21 15:10 NikkiBytes

@NikkiBytes I see that you modified setup_release in parser.py as well. Is this change also intended?

Also, you can fork this repository and open a pull request next time, it's easier to merge the changes that way. I can just incorporate the changes from your repository into this repository this time.

zcqian avatar Oct 20 '21 17:10 zcqian

A few updates....

The document structure has been updated, example output below:

[
    {
        "_id": "176377",
        "agr": {
            "ortholog": [
                {
                    "geneid": "SGD:S000003566",
                    "symbol": "VPS53",
                    "taxid": "NCBITaxon:559292",
                    "algorithmsmatch": 9,
                    "outofalgorithms": 10,
                    "isbestscore": true,
                    "isbestrevscore": true
                }
            ]
        }
    },
    {
        "_id": "ZDB-GENE-041114-199",
        "agr": {
            "ortholog": [
                {
                    "geneid": "SGD:S000003566",
                    "symbol": "VPS53",
                    "taxid": "NCBITaxon:559292",
                    "algorithmsmatch": 8,
                    "outofalgorithms": 10,
                    "isbestscore": true,
                    "isbestrevscore": true
                }
            ]
        }
    },
    {
        "_id": "1311391",
        "agr": {
            "ortholog": [
                {
                    "geneid": "SGD:S000003566",
                    "symbol": "VPS53",
                    "taxid": "NCBITaxon:559292",
                    "algorithmsmatch": 7,
                    "outofalgorithms": 9,
                    "isbestscore": true,
                    "isbestrevscore": true
                }
            ]
        }
    }
.
.
.
]

Notes:

  • The variable, "algorithms": "PhylomeDB|OrthoFinder|Hieranoid|OMA|Ensembl Compara|Roundup|InParanoid|PANTHER|OrthoInspector" is available in the data file. Currently the created output document doesn't include it (as shown above), but it can easily be added. [Note, if it is added, @newgene and I discussed possibly reformatting the data string into a list.] I noted this also in the README.md file in the repository.
    i.e. "PhylomeDB|OrthoFinder|Hieranoid|OMA|Ensembl Compara|Roundup|InParanoid|PANTHER|OrthoInspector" ---> ["PhylomeDB", "OrthoFinder", "Hieranoid", "OMA", "Ensembl", "Compara", "Roundup", "InParanoid", "PANTHER", "OrthoInspector"]

  • There are some gene _id variables that are returning None when querying with biothings_client. You can see it in the example above, the second document "_id": "ZDB-GENE-041114-199" returned None when querying with these commands.....

from biothings_client import get_client

gene_client = get_client('gene')

gene=gene_client.getgene(gene_id, fields='symbol,name')

Maybe there is another query method I can use, or there will be unique ID cases without matches? @newgene @andrewsu

NikkiBytes avatar Oct 26 '21 20:10 NikkiBytes

Can you add a mapping document to the plugin? Thanks!

zcqian avatar Oct 26 '21 20:10 zcqian

@NikkiBytes let's make the taxid field as an integer.

newgene avatar Oct 26 '21 20:10 newgene

@NikkiBytes I assume you're done making the changes at https://github.com/NikkiBytes/orthologyAGR ? If so I will merge the changes

zcqian avatar Dec 21 '21 00:12 zcqian

added @jal347 to work with @NikkiBytes on the deployment on mygene.info API.

newgene avatar Mar 22 '24 17:03 newgene