uta
uta copied to clipboard
Don't align transcripts with different numbers of exons
Originally reported by Reece Hart (Bitbucket: reece, GitHub: reece) in biocommons/uta #195 Migrated by bitbucket-issue-migration on 2016-09-09 15:15:07
UTA historically has aligned transcript and genomic exons even when the number of exons in each exon set differs. This practice masks real issues in underlying data and should be discontinued.
I have discovered an issue with transcript NM_001278433.1 (gene PRKAR1A), which I believe is an example of this issue. If my understanding is incorrect, please let me know.
Exon sets for the transcript:
SET search_path=uta_20180821;
SELECT * FROM exon_set WHERE tx_ac='NM_001278433.1';
267741 NM_001278433.1 AC_000149.1 1 splign 2014-02-11 01:22:19.920492
332948 NM_001278433.1 NC_000017.10 1 blat 2014-02-11 02:40:24.121284
267727 NM_001278433.1 NC_000017.10 1 splign 2014-02-11 01:22:19.920492
763376 NM_001278433.1 NC_000017.11 1 splign 2016-08-27 17:40:37.616249
267735 NM_001278433.1 NC_018928.2 1 splign 2014-02-11 01:22:19.920492
738588 NM_001278433.1 NM_001278433.1 1 transcript 2016-08-27 10:28:27.974572
88837 NM_001278433.1 NM_001278433.1 1 transcript/8ecabff0 2014-02-11 00:00:18.455632
344311 NM_001278433.1 NM_001278433.1 1 transcript/92190059 2015-08-25 22:44:41.311184
The GRCh37 splign chromosomal alignment has 10 exons:
SET search_path=uta_20180821;
SELECT * FROM exon WHERE exon_set_id='267727';
The "self" alignment has 11 exons:
SET search_path=uta_20180821;
SELECT * FROM exon WHERE exon_set_id='738588';
By looking at exon lengths, the discrepancy is in exon 1 so when doing g-to-c calculations using hgvs, variants along the entire transcript give bad results.
My assumption was that "transcript" is the relevant self-alignment, and not "transcript/8ecabff0" or "transcript/92190059"
First, I'm impressed that you dove this far into UTA internals!
I don't know the story for this transcript specifically, and these data are 4-6 years old, perhaps from the time before NCBI released gff files. So, this might be hard to reproduce now from sources.
When alt_aln_method
contains /
, it means that the UTA loader encountered a case where the definition provided by NCBI changed over time. When this happens, UTA deprecates the existing one by renaming the alignment method. (The hash after the / is a truncated md5 made by serializing the start,end coordinates and CDS start,end.)
The presence of / nearly always mean that the assembly and/or alignments are problematic. So, proceed with caution.
In uta_20190926, I see this:
anonymous@uta/uta=> set search_path = uta_20190926 ;
anonymous@uta/uta=> select alt_ac, alt_aln_method, n_exons from tx_exon_set_summary_mv where tx_ac = 'NM_001278433.1' order by 2;
┌────────────────┬─────────────────────┬─────────┐
│ alt_ac │ alt_aln_method │ n_exons │
├────────────────┼─────────────────────┼─────────┤
│ NC_000017.10 │ blat │ 11 │
│ NC_018928.2 │ splign │ 10 │
│ AC_000149.1 │ splign │ 10 │
│ NC_000017.10 │ splign │ 11 │
│ NC_000017.11 │ splign │ 11 │
│ NC_000017.10 │ splign/04e3c837 │ 10 │
│ NM_001278433.1 │ transcript │ 11 │
│ NM_001278433.1 │ transcript/8ecabff0 │ 11 │
│ NM_001278433.1 │ transcript/92190059 │ 10 │
└────────────────┴─────────────────────┴─────────┘
So, it looks to me as though you should upgrade to uta_20190926, in which NM_001278433.1 aligns to NC_000017.10 and NC_000017.11 without issues.
Please close if that answers your question.
Reece:
Thank you very much for your time-- that was helpful.
I don't see uta_20190926 as a tag on the dockerhub page, so I wasn't sure if it was advisable to use: https://hub.docker.com/r/biocommons/uta/tags
Is this version an "official" release that was built/validated to the same standards as the uta_20180821 version?
Also, if we did update to the 2019 uta, which versions of hgvs and seqrepo would you recommend moving up to?
We currently use:
- uta: uta_20180821
- seqrepo: 2018-08-21
- hgvs: 1.3.0
Thanks again.
Matt
uta_20190926 currently has an issue (#228) that prevents us from building a docker images. A change was made to materialize a very large view, and it takes >12 hours (when I killed it) to materialize data. We'll need to unwind that before distributing docker images.
You should be able to use any version of hgvs. The change log may help you figure out whether any of the changes since 1.3.0 are relevant to you.
Unfortunately, you'll have to wait on the uta fixes. No ETA yet.