gemini icon indicating copy to clipboard operation
gemini copied to clipboard

allow using canonical transcript instead of highest impact

Open brentp opened this issue 9 years ago • 16 comments

see: https://groups.google.com/forum/#!topic/gemini-variation/KKCO05-RNYo

brentp avatar Feb 02 '16 13:02 brentp

Hey ya'll,

What do you think we'd have to do to implement this? I've had some folks we work with asking about it.

roryk avatar Nov 06 '17 19:11 roryk

I like Brent's idea of '--use-canonical' flag for those who want ranking based on canonical transcripts. The default can still remain the same.

udp3f avatar Nov 06 '17 19:11 udp3f

Is the canonical transcript flagged in the snpEFF/VEP output and folded into the INFO field somehow?

roryk avatar Nov 06 '17 19:11 roryk

Yes there's a '--canonical' flag in VEP and that should be available in INFO. Not sure about snpEff though.

udp3f avatar Nov 06 '17 20:11 udp3f

as long as there is a way to know which is the Canonical (is it put first?) This should be implemented by adding a flag to gemini and then passing it to geneimpacts module which does the variant prioritization.

brentp avatar Nov 06 '17 20:11 brentp

++ particularly on being able to make hgvs use the canonical transcript

jxchong avatar Nov 14 '17 06:11 jxchong

Is there any kind of workaround we could implement before this feature gets added? Perhaps some way to get the VEP --pick (https://useast.ensembl.org/info/docs/tools/vep/script/vep_options.html) information added to variant_impact?

davemcg avatar Mar 13 '18 21:03 davemcg

@brentp I've run into this issue as well and agree this would be nice to fix. As mentioned, VEP has a "canonical" field which is in the INFO column (and gets imported into gemini as vep_canonical). For SnpEff, they have a -canon option which only annotates the canonical transcript (which seems similar to the --pick option in VEP), but I'm not sure that's what users will want to do.

Instead, I wonder how difficult it would be to determine which is the canonical transcript on the fly in the geneimpacts module. Here is the simple rule that SnpEff uses to determine canonical:

"Canonical transcripts are defined as the longest CDS of amongst the protein coding transcripts in a gene. If none of the transcripts in a gene is protein coding, then it is the longest cDNA. "

Here is the rule that Ensembl uses, which is spelled out in a more complicated way, but is I believe essentially the same:

"The canonical transcript is used in the gene tree analysis in Ensembl and does not necessarily reflect the most biologically relevant transcript of a gene. For human, the canonical transcript for a gene is set according to the following hierarchy: 1. Longest CCDS translation with no stop codons. 2. If no (1), choose the longest Ensembl/Havana merged translation with no stop codons. 3. If no (2), choose the longest translation with no stop codons. 4. If no translation, choose the longest non-protein-coding transcript."

Basically you could look at the Protein_position column from the CSQ field (VEP) or the Amino_Acid_length column from the EFF field (SnpEff) to get the length of the CDS for the transcripts and sort from highest to lowest. I suppose you would need something to use as a tie-breaker -- maybe highest number of exons, then longest transcript, then pick random?

oleraj avatar Nov 07 '18 20:11 oleraj

Hi @oleraj I'd gladly accept a PR for this.

brentp avatar Nov 07 '18 20:11 brentp

After doing some more digging, it looks like this was maybe already fixed by @roryk in geneimpacts (at least for VEP-annotated VCFs), though I haven't tested it yet. https://github.com/brentp/geneimpacts/commit/c1fd841c65a65e83be7fe1d1304785bb6db0642d#diff-ef46603b09e1d94334dfde203c2a72db This is in at least version 0.3.4 for geneimpacts; can we update this in the requirements file for gemini? Maybe geneimpacts>=0.3.6? Currently it's using 0.1.3.

oleraj avatar Nov 07 '18 22:11 oleraj

@brentp Once geneimpacts is updated in the newest release, does that mean GEMINI will default to loading the canonical impact and not the most severe? cc @oleraj

jxchong avatar Jan 16 '19 20:01 jxchong

@jxchong. No. someone would have to implement an option (in gemini) to allow it to choose canonical transcripts).

brentp avatar Jan 16 '19 20:01 brentp

Hmm, I am not sure using the canonical impact is the best approach in the context of rare disease. I would rather manually refute candidates than miss them because they are on a different transcript.

arq5x avatar Jan 16 '19 20:01 arq5x

@arq5x Yes it's a mixed bag (why I asked). For discovery, I'd want the most severe impact, but for reporting out, I might prefer canonical transcript (it's not uncommon for us to sometimes pull out a severe impact in a incomplete/suspect/unsupported transcript and a synonymous change in all other transcripts)

jxchong avatar Jan 16 '19 20:01 jxchong

Agreed, I don't really like using the canonical approach because you miss impactful variants. We missed a P53 variant because it was on a non-canonical transcript recently. But the canonical setting is useful when talking to other folks, I've found that clinicians know mutations by the amino acid change in the canonical transcript, so to be on the same page, we have to be talking about the same thing.

roryk avatar Jan 16 '19 20:01 roryk

Just a reminder that the canonical transcript should be available in the variant_impacts table.

brentp avatar Jan 16 '19 20:01 brentp