Dataprovider - get canonical / MANE transcript
There is often the need for picking 1 transcript for a gene. This is often referred to as the canonical transcript, and nowadays (clinically) is usually the MANE transcript
It would be good to add a method to data provider to be able to retrieve the MANE transcript from a gene name
Transcripts could also have a field/flag on them saying whether they are MANE transcripts
Once we have the data provider API done, implementations like UTA or cdot could implement them
This is necessary to implement #517
Also came up as a request for #743
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
The main question here is how to handle UTA implementation. It doesn't look like a modification to that will be available any time soon
I could make a shim around calls to UTA and implement MANE stuff by loading MANE.GRCh38.v1.4.summary.txt.gz - 1.1Mb
So you could either supply that file, or get a NotImplementedError thrown
I will have a crack at this when I get home (currently at a conference)
This could be implemented somewhere around the get_tx_for_region method. Eg. as a second filter on top of the initial response, (only return tx_acs that are part of MANE transcripts).
First, we need a canonical definition of canonical....
"canonical" isn't a property of a transcript but a label/choice, made differently by annotation providers (RefSeq/Ensembl) as well as build, and release (MANE version)
Eg using GFF/GTF tags:
| Annotation | Build | GFF Tags |
|---|---|---|
| RefSeq | GRCh37 | RefSeq select |
| RefSeq | GRCh38 | RefSeq select, MANE |
| Ensembl | GRCh37 | n/a - see notes below |
| Ensembl | GRCh38 | Ensembl_canonical, MANE |
Ensembl GRCh37 - "canonical" is exposed in the Ensembl REST API
So given the choice of:
Pick canonical server-side
- Make a new API request for canonical
- Add a new field
{"canonical": True}
Pick canonical client-side
- Dataprovider
get_tx_for_regionandget_tx_for_genereturns transcript "tags" (eg "Ensembl_canonical, MANE") - The client consumes the transcript/tags and then uses a CanonicalPicker class to decide which one is canonical
I lean towards the client-side as:
- Client decides on details of how to pick canonical transcript
- We don't have to wait for UTA - we can immediately implement LongestTranscriptCanonicalPicker or make a local version that loads MANE text file and uses that plus transcripts to pick
- We can make a client picker that reads tags, picks eg tags containing MANE, Ensembl, RefSeq select, sorts then returns the highest per desired contig - this will be ready to go once UTA has tags in it.
I guess a question is whether UTA should have eg multiple versions of MANE in it. In which case, we'd need to pass in the MANE version somewhere in the API. Or you could just return tags + versions eg ["MANE:v1.3", "MANE:v1.4"] and do the picking in the client again
I'm going to remove this from the hgvs 2.0 milestone. I agree that biocommons tools should make it easy to identify a MANE transcript, but I'm a bit skeptical that this should be in the hgvs package itself.