uta icon indicating copy to clipboard operation
uta copied to clipboard

GRCh38 splign alignments available?

Open deannachurch opened this issue 5 years ago • 6 comments

Hi- I'm doing some variant mapping. Starting with this:

v = hp.parse_hgvs_variant("NM_007194.4(CHEK2):c.1611T>A")
print(v)
print(v.posedit.pos.start)
print(v.posedit.edit.ref) 

I get: NM_007194.4(CHEK2):c.1611T>A 1611 T

All is good. I can get the genomic location for GRCh37

I can get a location on GRCh37:

am37 = hgvs.assemblymapper.AssemblyMapper(hdp, assembly_name='GRCh37', alt_aln_method='splign', replace_reference=True)
var_g37=am37.c_to_g(v)
print(var_g37)

NC_000022.10:g.29083906A>T

But when trying to get the location on GRCh38

am38 = hgvs.assemblymapper.AssemblyMapper(hdp, assembly_name='GRCh38', alt_aln_method='splign', replace_reference=True)
var_g38=am38.c_to_g(v)
print(var_g38)

I get: HGVSDataNotAvailableError: No alignments for NM_007194.4 in GRCh38 using splign

Which is surprising as NM_007194.4 is still the reference- and is in fact the MANE transcript. As I'm trying to go back and forth between some RefSeq and Ensembl transcripts based on MANE, GRCh38 is preferable- though I will try with GRCh37 for now.

thanks!

deannachurch avatar Jan 02 '20 00:01 deannachurch

UTA needs an update. The most recent version is uta_20180821. Although NM_007194.4 is in UTA, the 38 alignment for it is not. This typically means that it wasn't in the gff3 files at the time of the snapshot, or that it failed alignment criteria and was rejected. The criteria were selected with input from Terence M. in an effort to match alignment filtering at NCBI. NM_007194.4 became current on June 13, 2018, so perhaps the alignments didn't exist in Aug 2018.

I am looking for funding to automate the construction of UTA so that it doesn't fall so far behind.

reece avatar Jan 02 '20 14:01 reece

Additional comments:

  1. UTA does contain the alignment of .3 to 38. So, if you know that that exon structures and alignments are consistent in this exon, you can probably get a way with using .3.

  2. You can see available transcripts and alignments like this:

$ export PGPASSWORD=uta_public
$ psql -h uta.biocommons.org -d uta -U uta_public
uta_public@uta/uta=> set search_path = uta_20180821;
uta_public@uta/uta=> select tx_ac,alt_ac,alt_aln_method from exon_set where tx_ac ~'^NM_007194' order by 1,2;
┌─────────────┬──────────────┬────────────────┐
│    tx_ac    │    alt_ac    │ alt_aln_method │
├─────────────┼──────────────┼────────────────┤
│ NM_007194.3 │ AC_000154.1  │ splign         │
│ NM_007194.3 │ NC_000022.10 │ splign         │
│ NM_007194.3 │ NC_000022.10 │ blat           │
│ NM_007194.3 │ NC_000022.11 │ splign         │
│ NM_007194.3 │ NC_018933.2  │ splign         │
│ NM_007194.3 │ NG_008150.1  │ splign         │
│ NM_007194.3 │ NM_007194.3  │ transcript     │
│ NM_007194.4 │ NC_000022.10 │ splign         │
│ NM_007194.4 │ NM_007194.4  │ transcript     │
└─────────────┴──────────────┴────────────────┘
(9 rows)

reece avatar Jan 02 '20 14:01 reece

Hi Reece, Thanks for the update. It turns out my bigger problem is that I need to project alignments onto NM_001257387.2, which does not seem to be in 37, only .1 is in 37. These two seem a bit different - at least based on length (.1 is 1976 bases and .2 is 1958 bases) Is there another source I can get the UTA updated from? Is it difficult to install UTA and then add information to it? Thanks for your help- I appreciate this is unfunded but it is super valuable. If there is anything I can do to help (letters, etc) please let me know.

best, -deanna

deannachurch avatar Jan 02 '20 15:01 deannachurch

Updating UTA is a pain right now, which is why it's languished (much to my chagrin). I wouldn't wish that process on anyone. (However, instructions do exist if you're feeling intrepid. It refers to hosts within Invitae, but the process would be the same for your own installations.)

As for the offer of help, thanks! I'll follow up by email.

reece avatar Jan 02 '20 16:01 reece

Thanks Reece- appreciate the rapid response on the update.

deannachurch avatar Jan 02 '20 18:01 deannachurch

Note to self for future UTA update: I had hoped to use the gff3 files as-is for UTA cigar strings. This won't be possible because the gff3 files don't denote mismatches, which hgvs uses to correct for reference sequence differences.

Example: GCF_000001405.28_knownrefseq_alignments.gff3 (mirrored on 2019-09-05) contains:

NC_000010.11	RefSeq	cDNA_match	87863438	87864548	1104.71	+	.	ID=d73f8942-0138-46b9-8e95-56e7ebc1c240;Target=NM_000314.6 1 1110 +;gap_count=1;identity=0.99977;idty=0.9982;num_ident=8700;num_mismatch=1;pct_coverage=100;pct_identity_gap=99.977;pct_identity_ungap=99.9885;Gap=M666 D1 M444

UTA contains the cigar string M666 I1 39=1X404=. The I/D swap is because UTA is transcript-centric. The more interesting difference is the 39=1X404= / M444 difference. The length is the same, but the uta alignment correctly picked up that there is a mismatch. Note that NCBI's gff3 shows num_mismatch=1.

The upshot is that UTA will need to continue aligning regions in order to pick up mismatches.

reece avatar Feb 10 '20 06:02 reece

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Nov 30 '23 01:11 github-actions[bot]

This issue was closed because it has been stalled for 7 days with no activity.

github-actions[bot] avatar Dec 07 '23 01:12 github-actions[bot]

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Mar 09 '24 01:03 github-actions[bot]

This issue was closed because it has been stalled for 7 days with no activity.

github-actions[bot] avatar Mar 16 '24 01:03 github-actions[bot]