hgvs icon indicating copy to clipboard operation
hgvs copied to clipboard

Consider handling IVS coordinates

Open reece opened this issue 9 years ago • 8 comments

Originally reported by: Reece Hart (Bitbucket: reece, GitHub: reece)


Originally from an email to hgvs-discuss:

Previously, HGVS recommendation was to use either c.88+2T>G / c.89-1G>T either c.IVS2+2T>G / c.IVS2-1G>T. They dropped the latter (http://www.hgvs.org/mutnomen/refseq.html#IVS) however there is still a lot of articles using it. Is there a way to convert from c.IVS2+2T>G to c.88+2T>G (or genomic coordinates)? To make things worse the articles usually only provide the gene name, not the transcript.


  • Bitbucket: https://bitbucket.org/biocommons/hgvs/issue/267

reece avatar Sep 02 '15 23:09 reece

I am interested in how to handle IVS variant strings too, as I want to convert variants from older papers in bulk to genomic coordinates. What is the "trivial" solution for handling these?

hannes-brt avatar Sep 20 '18 20:09 hannes-brt

IVS syntax is not part of HGVS recommendations. So, the only use case for IVS is to translate old variants into base-offset positions (e.g., c.123-4). I don't think we'd ever support generating IVS variants.

@hannes-brt and others: Please attach IVS examples that you have. If we get enough requests, we'll reconsider whether we can support IVS variants.

This problem is hard because IVS variants are irregularly formatted and because we may not be able to find transcript definitions for older transcripts. And we definitely won't find genome alignments for those older transcripts.

reece avatar Sep 20 '18 21:09 reece

Many thanks @reece. Yes, obviously the only use case for this is converting older sources, but I find that a lot of these are still around.

For example, the DBASS databases (http://www.dbass.org.uk), but also this recent review paper on deep intronic variants: https://rdcu.be/7nT9.

hannes-brt avatar Sep 21 '18 19:09 hannes-brt

IVS (and EX) variants imply an exon structure, but most such variants don't specify a transcript. Therefore, such variants are roughly like street addresses without the street name. The reason this is an unsolved issue is that I've never figured out how to reliably infer the transcript record from IVS/EX variants.

Take IVS1+1G>A in HMBS (from http://www.dbass.org.uk/DBASS5/viewsplicesite.aspx?id=180) ? Without a transcript specified, the exon structure is undefined. To make matters worse, the link to the gene on that page is defunct, which means we can't even guess at plausible transcripts. As far as I'm concerned, that makes the data nearly unusable.

I don't have access to the manuscript you linked. If you're saying that it contains IVS or EX variants, I'm incredulous: No journal or reviewer should have accepted a paper in 2017 that contains variants that are not in HGVS format.

I realize that I'm expressing strong opinions. However, sharing variation data must be made precise or else the anticipated inferences for human biology will be lost and, possibly, people will be clinically harmed.

reece avatar Sep 21 '18 20:09 reece

You are completely right about the issue of missing transcript reference. Of course, some papers were written pre-Human Genome Project and the transcripts they were talking about were not even fully known, much less all possible isoforms of the gene.

I can only brainstorm about ways to infer the correct transcript from these variants: it seems one would need to start from a small number of canonical RefSeq transcripts and then check that the reference allele matches. Hopefully, this would work with reasonably high confidence for the majority of variants.

Sorry about the journal link, it was supposed to be a sharable link without a paywall. In any case, here's a PDF link: https://drive.google.com/file/d/1ZTSD2FYAPZuSKyEyTw-4j5WHZOis6Uvt/view?usp=sharing

This is a review paper, summarizing over a hundred known variants, but they are all in the format in which they were originally reported.

I fully agree with all your points, however, given how many of these resources still exist it would be great to have a way forward to making these variants available in modern pipelines.

hannes-brt avatar Sep 21 '18 21:09 hannes-brt

I'm game for brainstorming! I'd like to solve this, but only if it's reliable.

I already have a script to kinda do what you suggested in hgvs-guess-plausible-transcripts. It works like this:

(3.6) snafu$ ./misc/experimental/hgvs-guess-plausible-transcripts 'HFE2:c.187_188insGAG' 'TNFRSF1A:c.123T>C' 'TNFRSF1A:n.426T>C' FRSF1A:n.426T>C' 
HFE2:c.187_188insGAG	5	NM_213653.3:c.187_188insGAG	NM_202004.3:c.187_188insGAG	NM_145277.4:c.187_188insGAG	NM_001316767.1:c.187_188insGAG	NM_213652.3:c.187_188insGAG
TNFRSF1A:c.123T>C	1	NM_001065.3:c.123T>C
TNFRSF1A:n.426T>C	1	NR_144351.1:n.426T>C

For each quasi-variant on the command line, the script constructs the variant on all of the transcripts for the named gene. If the variant is considered valid (in the hgvs validator sense), then it's displayed. Columns above are input variant, # of results, list of results (all tab sep).

IVS is a bit harder, but doable.

reece avatar Sep 21 '18 22:09 reece

Hi Reece and Hannes,

Interesting discussion.

There are several major issues regarding support for IVS that have not been touched on yet.

  1. There is no consensus for numbering Exons, so intron numbering is equally problematic. For example, some folks number exons from 1 to last depending on their order in each individual transcript reference sequence rather than the order with which they appear in a gene. Therefore IVS2 for a particular transcript in a given gene may be IVS5 for a different transcript (I can find some real life examples if required!). Consequently, VALID variant descriptions may match several different transcripts, but will project onto several genomic positions.
  2. You would need to know the reference sequence for the intronic sequence so that variants with stated reference bases could be validated. Variants pre GRCh37 would be a huge problem. Equally, the intronic sequences for some genes has changed between genome builds GRCh37 and GRCh38. Similarly, if a RefSeqGene or LRG was used for the IVS sequence, it may differ from both GRCh37 and GRCh38.
  3. You would also need to be certain that the selected transcript reference sequence has not been re-aligned to the selected genomic reference sequence since publication of the variant.

HGVS dropped the IVS nomenclature because it was too unreliable. While, in theory it might be possible to support some IVS descriptions, I think that doing so reliably in all instances will most likely be impossible. Consequently, it could be that Reece’s initial reservations about supporting IVS variants may be well founded. I also agree with Reece that a paper using the IVS nomenclature in 2017 should not have been accepted.

Cheers

Pete

From: Reece Hart [email protected] Reply-To: biocommons/hgvs [email protected] Date: Friday, 21 September 2018 at 23:16 To: biocommons/hgvs [email protected] Cc: Subscribed [email protected] Subject: Re: [biocommons/hgvs] Consider handling IVS coordinates (#267)

I'm game for brainstorming! I'd like to solve this, but only if it's reliable.

I already have a script to kinda do what you suggested in hgvs-guess-plausible-transcriptshttps://github.com/biocommons/hgvs/blob/master/misc/experimental/hgvs-guess-plausible-transcripts. It works like this:

(3.6) snafu$ ./misc/experimental/hgvs-guess-plausible-transcripts 'HFE2:c.187_188insGAG' 'TNFRSF1A:c.123T>C' 'TNFRSF1A:n.426T>C' FRSF1A:n.426T>C'

HFE2:c.187_188insGAG 5 NM_213653.3:c.187_188insGAG NM_202004.3:c.187_188insGAG NM_145277.4:c.187_188insGAG NM_001316767.1:c.187_188insGAG NM_213652.3:c.187_188insGAG

TNFRSF1A:c.123T>C 1 NM_001065.3:c.123T>C

TNFRSF1A:n.426T>C 1 NR_144351.1:n.426T>C

For each quasi-variant on the command line, the script constructs the variant on all of the transcripts for the named gene. If the variant is considered valid (in the hgvs validator sense), then it's displayed. Columns above are input variant, # of results, list of results (all tab sep).

IVS is a bit harder, but doable.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/biocommons/hgvs/issues/267#issuecomment-423685415, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AZVouS2DILVgfMZ3l-E4C1j6mgosTBfqks5udWUdgaJpZM4Wy4Dv.

Peter-J-Freeman avatar Sep 24 '18 11:09 Peter-J-Freeman

The DBASS databases are a total mess. It's useful that they have attempted to record splice sites created by sequence variants, but their variant description system is so far removed from the HGVS nomenclature that it's often impossible to understand what's meant without inspecting the entire record.

For example, take a look at the COL1A1 variant (http://www.dbass.org.uk/DBASS5/viewsplicesite.aspx?id=383). It's described as "E49+259A>G" but position 259 is not an intronic location, as you might expect. It appears to be nucleotide position 259 within an exon, and it's not exon 49 either. It turns out the they variant lies in exon 48 if conventional exon numbering is used, but in exon 49 if the legacy exon numbering system for COL1A1 is used.

leicray avatar Sep 24 '18 11:09 leicray

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Feb 28 '24 01:02 github-actions[bot]

This issue was closed because it has been stalled for 7 days with no activity.

github-actions[bot] avatar Mar 08 '24 01:03 github-actions[bot]