pyobo icon indicating copy to clipboard operation
pyobo copied to clipboard

Annotate genes in HGNC with SO terms

Open cthoyt opened this issue 3 years ago • 4 comments

Either find or create terms for all HGNC gene locus types that can be used to annotate all genes in HGNC:

  • [x] Fragile site https://github.com/The-Sequence-Ontology/SO-Ontologies/issues/301
  • [x] RNA, Y (see SO:0000405 which is for transcripts) https://github.com/The-Sequence-Ontology/SO-Ontologies/issues/550
  • [ ] RNA, cluster (https://github.com/The-Sequence-Ontology/SO-Ontologies/issues/564)
  • [x] RNA, vault (see SO:0000404 which is for transcripts) https://github.com/The-Sequence-Ontology/SO-Ontologies/issues/552
  • [ ] phenotype only (HGNC uses SO:0001500 but this is not under gene)
  • [x] complex locus constituent (29) these genes encode proteins that are part of complexes. Would suggest to HGNC to retire in favor of a standard protein coding gene annotation
  • [x] protocadherin (https://github.com/The-Sequence-Ontology/SO-Ontologies/issues/562)
  • [x] region (38) these aren't even necessarily genes - each of the 38 could use more careful annotation. In the mean time, they get SO:0001411 (biological region)
  • [ ] readthrough (I used SO:0000697 but this might not be right)
  • [ ] virus integration site (8) https://github.com/The-Sequence-Ontology/SO-Ontologies/issues/551

So far, the mappings I've made are in here: https://github.com/pyobo/pyobo/blob/dc7b4736f2bbf943084e8f8a95e1293c2717c566/src/pyobo/sources/hgnc.py#L110-L145

Related discussion

With HGNC on twitter:

On the OBO Foundry Slack workspace:

https://obo-communitygroup.slack.com/archives/C01BDKWDS91/p1631787773022200

cthoyt avatar Sep 16 '21 14:09 cthoyt

CC @sartweedie

cthoyt avatar May 26 '23 13:05 cthoyt

Just to clarify the situation with ‘complex locus constituent’ - this isn’t for genes that encode proteins that are part of complexes but rather complex in the sense of complicated. These are unusual cases where the research community have requested names for parts of complicated loci encoding many alternate isoforms. We think the closest SO term is gene_fragment (SO:0000997).

sartweedie avatar May 26 '23 14:05 sartweedie

Readthroughs are another oddity - they really represent transcripts derived from more than one adjacent gene. However, they are often discussed and treated as separate ‘gene’s distinct from the component genes that contribute to the ‘readthrough’ so some have been named separately. SO:0000697 doesn’t work for these. We suggest making a new SO term for these under transcript. I can put in a ticket for this.

sartweedie avatar May 26 '23 15:05 sartweedie

SO:0001500 is fine for phenotype I think even though it isn't under gene. All of the HGNC phenotype records have all been withdrawn (though they still appear in our records as withdrawn).

sartweedie avatar May 26 '23 15:05 sartweedie