uta icon indicating copy to clipboard operation
uta copied to clipboard

Don't align transcripts with different numbers of exons

Open reece opened this issue 9 years ago • 4 comments

Originally reported by Reece Hart (Bitbucket: reece, GitHub: reece) in biocommons/uta #195 Migrated by bitbucket-issue-migration on 2016-09-09 15:15:07


UTA historically has aligned transcript and genomic exons even when the number of exons in each exon set differs. This practice masks real issues in underlying data and should be discontinued.

reece avatar Sep 28 '15 19:09 reece

I have discovered an issue with transcript NM_001278433.1 (gene PRKAR1A), which I believe is an example of this issue. If my understanding is incorrect, please let me know.

Exon sets for the transcript:

SET search_path=uta_20180821;
SELECT * FROM exon_set WHERE tx_ac='NM_001278433.1';

267741	NM_001278433.1	AC_000149.1	1	splign	2014-02-11 01:22:19.920492
332948	NM_001278433.1	NC_000017.10	1	blat	2014-02-11 02:40:24.121284
267727	NM_001278433.1	NC_000017.10	1	splign	2014-02-11 01:22:19.920492
763376	NM_001278433.1	NC_000017.11	1	splign	2016-08-27 17:40:37.616249
267735	NM_001278433.1	NC_018928.2	1	splign	2014-02-11 01:22:19.920492
738588	NM_001278433.1	NM_001278433.1	1	transcript	2016-08-27 10:28:27.974572
88837	NM_001278433.1	NM_001278433.1	1	transcript/8ecabff0	2014-02-11 00:00:18.455632
344311	NM_001278433.1	NM_001278433.1	1	transcript/92190059	2015-08-25 22:44:41.311184

The GRCh37 splign chromosomal alignment has 10 exons:

SET search_path=uta_20180821;
SELECT * FROM exon WHERE exon_set_id='267727';

The "self" alignment has 11 exons:

SET search_path=uta_20180821;
SELECT * FROM exon WHERE exon_set_id='738588';

By looking at exon lengths, the discrepancy is in exon 1 so when doing g-to-c calculations using hgvs, variants along the entire transcript give bad results.

My assumption was that "transcript" is the relevant self-alignment, and not "transcript/8ecabff0" or "transcript/92190059"

gostachowiak avatar Sep 07 '20 11:09 gostachowiak

First, I'm impressed that you dove this far into UTA internals!

I don't know the story for this transcript specifically, and these data are 4-6 years old, perhaps from the time before NCBI released gff files. So, this might be hard to reproduce now from sources.

When alt_aln_method contains /, it means that the UTA loader encountered a case where the definition provided by NCBI changed over time. When this happens, UTA deprecates the existing one by renaming the alignment method. (The hash after the / is a truncated md5 made by serializing the start,end coordinates and CDS start,end.)

The presence of / nearly always mean that the assembly and/or alignments are problematic. So, proceed with caution.

In uta_20190926, I see this:

anonymous@uta/uta=> set search_path  = uta_20190926 ;
anonymous@uta/uta=> select alt_ac, alt_aln_method, n_exons from tx_exon_set_summary_mv where tx_ac = 'NM_001278433.1' order by 2;
┌────────────────┬─────────────────────┬─────────┐
│     alt_ac     │   alt_aln_method    │ n_exons │
├────────────────┼─────────────────────┼─────────┤
│ NC_000017.10   │ blat                │      11 │
│ NC_018928.2    │ splign              │      10 │
│ AC_000149.1    │ splign              │      10 │
│ NC_000017.10   │ splign              │      11 │
│ NC_000017.11   │ splign              │      11 │
│ NC_000017.10   │ splign/04e3c837     │      10 │
│ NM_001278433.1 │ transcript          │      11 │
│ NM_001278433.1 │ transcript/8ecabff0 │      11 │
│ NM_001278433.1 │ transcript/92190059 │      10 │
└────────────────┴─────────────────────┴─────────┘

So, it looks to me as though you should upgrade to uta_20190926, in which NM_001278433.1 aligns to NC_000017.10 and NC_000017.11 without issues.

Please close if that answers your question.

reece avatar Sep 09 '20 04:09 reece

Reece:

Thank you very much for your time-- that was helpful.

I don't see uta_20190926 as a tag on the dockerhub page, so I wasn't sure if it was advisable to use: https://hub.docker.com/r/biocommons/uta/tags

Is this version an "official" release that was built/validated to the same standards as the uta_20180821 version?

Also, if we did update to the 2019 uta, which versions of hgvs and seqrepo would you recommend moving up to?

We currently use:

  • uta: uta_20180821
  • seqrepo: 2018-08-21
  • hgvs: 1.3.0

Thanks again.

Matt

gostachowiak avatar Sep 09 '20 15:09 gostachowiak

uta_20190926 currently has an issue (#228) that prevents us from building a docker images. A change was made to materialize a very large view, and it takes >12 hours (when I killed it) to materialize data. We'll need to unwind that before distributing docker images.

You should be able to use any version of hgvs. The change log may help you figure out whether any of the changes since 1.3.0 are relevant to you.

Unfortunately, you'll have to wait on the uta fixes. No ETA yet.

reece avatar Sep 12 '20 16:09 reece