hgvs icon indicating copy to clipboard operation
hgvs copied to clipboard

Dataprovider - get canonical / MANE transcript

Open davmlaw opened this issue 1 year ago • 5 comments

There is often the need for picking 1 transcript for a gene. This is often referred to as the canonical transcript, and nowadays (clinically) is usually the MANE transcript

It would be good to add a method to data provider to be able to retrieve the MANE transcript from a gene name

Transcripts could also have a field/flag on them saying whether they are MANE transcripts

Once we have the data provider API done, implementations like UTA or cdot could implement them

This is necessary to implement #517

Also came up as a request for #743

davmlaw avatar Aug 05 '24 11:08 davmlaw

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Nov 04 '24 02:11 github-actions[bot]

The main question here is how to handle UTA implementation. It doesn't look like a modification to that will be available any time soon

I could make a shim around calls to UTA and implement MANE stuff by loading MANE.GRCh38.v1.4.summary.txt.gz - 1.1Mb

So you could either supply that file, or get a NotImplementedError thrown

I will have a crack at this when I get home (currently at a conference)

davmlaw avatar Nov 07 '24 02:11 davmlaw

This could be implemented somewhere around the get_tx_for_region method. Eg. as a second filter on top of the initial response, (only return tx_acs that are part of MANE transcripts).

andreasprlic avatar Nov 17 '24 16:11 andreasprlic

First, we need a canonical definition of canonical....

"canonical" isn't a property of a transcript but a label/choice, made differently by annotation providers (RefSeq/Ensembl) as well as build, and release (MANE version)

Eg using GFF/GTF tags:

Annotation Build GFF Tags
RefSeq GRCh37 RefSeq select
RefSeq GRCh38 RefSeq select, MANE
Ensembl GRCh37 n/a - see notes below
Ensembl GRCh38 Ensembl_canonical, MANE

Ensembl GRCh37 - "canonical" is exposed in the Ensembl REST API

So given the choice of:

Pick canonical server-side

  • Make a new API request for canonical
  • Add a new field {"canonical": True}

Pick canonical client-side

  • Dataprovider get_tx_for_region and get_tx_for_gene returns transcript "tags" (eg "Ensembl_canonical, MANE")
  • The client consumes the transcript/tags and then uses a CanonicalPicker class to decide which one is canonical

I lean towards the client-side as:

  • Client decides on details of how to pick canonical transcript
  • We don't have to wait for UTA - we can immediately implement LongestTranscriptCanonicalPicker or make a local version that loads MANE text file and uses that plus transcripts to pick
  • We can make a client picker that reads tags, picks eg tags containing MANE, Ensembl, RefSeq select, sorts then returns the highest per desired contig - this will be ready to go once UTA has tags in it.

I guess a question is whether UTA should have eg multiple versions of MANE in it. In which case, we'd need to pass in the MANE version somewhere in the API. Or you could just return tags + versions eg ["MANE:v1.3", "MANE:v1.4"] and do the picking in the client again

davmlaw avatar Jul 29 '25 04:07 davmlaw

I'm going to remove this from the hgvs 2.0 milestone. I agree that biocommons tools should make it easy to identify a MANE transcript, but I'm a bit skeptical that this should be in the hgvs package itself.

reece avatar Oct 15 '25 16:10 reece