uta icon indicating copy to clipboard operation
uta copied to clipboard

uta_20150827 is missing ENSP accessions and sequences

Open reece opened this issue 9 years ago • 2 comments

Originally reported by Reece Hart (Bitbucket: reece, GitHub: reece) in biocommons/uta #194 Migrated by bitbucket-issue-migration on 2016-09-09 15:15:07


uta_20150827 does not include ENSP sequences or seqinfo. One consequence of this is that c_to_p transformations in hgvs result in MD5 accessions.

This issue should update uta with ENSP sequences and accessions (from release-79).

FWIW, this occurs because it was discovered that Ensembl sequence accessions are non-unique, as provided via fasta files on their web site. That is, a single accession may be associated with more than one sequence. Roughly 10,000 instances of ambiguous ENSPs exist between e-71 and e-81.

(It's likely that these ambiguities are distinguished by stable_id versions internally, but these distinctions are not exposed in the fasta files.)

reece avatar Sep 16 '15 17:09 reece

Original comment on Bitbucket by Reece Hart (Bitbucket: reece, GitHub: reece):


ftp.ensembl.org doesn't provide GRCh37 fasta downloads. That means that the only source for sequences is from the API. Fetching now.

reece avatar Sep 16 '15 23:09 reece

Original comment on Bitbucket by Reece Hart (Bitbucket: reece, GitHub: reece):


@PeteCauseyFreeman Consider watching this issue.

reece avatar Sep 16 '15 17:09 reece