mygene.info
mygene.info copied to clipboard
Add hgnc family info into MyGene.info
hgnc contains gene group info: https://www.genenames.org/data/genegroup/#!/group/567
This file has the gene-to-family links: http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/csv/genefamily_db_tables/gene_has_family.csv
hgnc_id | family_id |
---|---|
11148 | 3 |
3960 | 3 |
3961 | 3 |
3477 | 1963 |
4621 | 1963 |
4622 | 1963 |
9962 | 1963 |
16719 | 1963 |
This file has the name and metadata for each HGNC family: http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/csv/genefamily_db_tables/family.csv
id | abbreviation | name | external_note | pubmed_ids | desc_comment | desc_label | desc_source | desc_go | typical_gene |
---|---|---|---|---|---|---|---|---|---|
1296 | TIR domain containing | NULL | NULL | NULL | NULL | NULL | TIRAP | ||
75 | ZDBF | Zinc fingers DBF-type | NULL | NULL | NULL | NULL | NULL | ZDBF2 | |
302 | CLCN | Chloride voltage-gated channels | NULL | NULL | NULL | NULL | NULL | CLCN1 | |
228 | HCRTR | Hypocretin receptors | NULL | NULL | NULL | NULL | NULL | HCRTR1 |
The combination of these two files should be what we initially add to mygene.info records for each human gene.
It looks like there's already some gene group info in MyGene.info that is shown as node attribute in BTE.
BTE brings in interpro info using this code.
You can see that some gene family info is included when you look at that field in mygene like this, as well as maybe some info that's for specific domains of the protein?: https://mygene.info/v3/query?q=CDK2&fields=interpro.desc%2C%20type_of_gene
I made the plugin for the hgnc_family. The main branch contains the manifest and parser. v2 branch contains the advanced plugin. If we use the advanced plugin can someone check if I did the mapping correctly? thanks. https://github.com/jal347/hgnc_family
This is a quick summary of the current hgnc mapping. The total number of hgnc_id data points is 29872. The total number of unique hgnc_ids is 24952. Out of the 24952 hgnc_ids 24895 were mapped while 57 could not be queried in mygene.info. The number of 1-1 hgnc_id to family_id is 21100 and 1-n mapping is 3852. 1-7 is the max hgnc_id to family_id mapping. An example is shown below and more detailed information of the 1-n mappings.
{
"_id": "6624",
"hgnc_genegroup": [
{
"id": "3",
"abbr": "FSCN",
"name": "Fascin family",
"comments": "",
"pubmed": [
21618240
],
"typical_gene": "FSCN1"
}
]
}
(I remember commenting on this yesterday where did it go ...)
added here: https://github.com/biothings/mygene.info/tree/add_hgnc_family/src/plugins/hgnc_family
@newgene should PubMed ID be of type long
and not indexed? This is what we have in other sources in MyGene.
@zcqian (RE: pubmed) good catch. Let's keep this field the same as other sources then.
@newgene should we index the PubMed ID field?
No for now, we can change if later we do need to query it.