annotatr
annotatr copied to clipboard
Mismatch between transcripts and genes
Hi, I'm using annotatr and I've noticed in my annotation a transcript that is associated to the wrong gene.
It is ENST00000580062.5, which is listed (in the annotation) as a transcript for SNRPN (entrez/gene id: 6638). However, while I was performing joins between my annotation and ensembl data I noticed that this transcript is from the gene SNURF (entrez/gene id: 8926), as shown on ensembl (https://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000273173;r=15:24954987-24977850).
I'm not sure if the annotation's source is wrong, but I've checked ensembl's archive and it this transcript has always been assigned to SNURF, so at least I don't think an outdated package on my side is causing this.
My version of annotatr is 1.26.0
To reproduce this:
library(annotatr)
library(AnnotationHub)
# BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene", lib = .libPaths()[1]) # I've run this
testAnnots = "hg38_genes_cds"
testBuiltAnnotation = annotatr::build_annotations(genome = "hg38", annotations = testAnnots)
testAnnotation = annotatr::annotate_regions(
regions = GenomicRanges::makeGRangesFromDataFrame( # a test region that overlaps the transcript of interest
data.frame(CHROM = "chr15", POS = 24955049, end = 24955065),
seqnames.field = "CHROM",
start.field = "POS",
end.field = "end",
keep.extra.columns = T
),
annotations = testBuiltAnnotation,
ignore.strand = T,
quiet = F
)
testDf = as.data.frame(testAnnotation)
print(testDf[testDf$annot.tx_id == "ENST00000580062.5", ])
Which prints:
seqnames start end width strand annot.seqnames annot.start annot.end annot.width annot.strand annot.id
2 chr15 24955049 24955065 17 * chr15 24955049 24955062 14 + CDS:618534
annot.tx_id annot.gene_id annot.symbol annot.type
2 ENST00000580062.5 6638 SNRPN hg38_genes_cds
Thanks for bringing this up. It is a bit odd; when I reach into the TxDb object for hg38 and extract the transcript with Ensembl transcript ID ENST00000580062.5 I get:
> tx_gr[tx_gr$TXNAME == 'ENST00000580062.5']
GRanges object with 1 range and 3 metadata columns:
seqnames ranges strand | TXID GENEID TXNAME
<Rle> <IRanges> <Rle> | <integer> <CharacterList> <character>
[1] chr15 24955004-24977850 + | 174731 6638 ENST00000580062.5
Suggesting that the annotation in the TxDb package is the source of what you're seeing. Unfortunately, I'm not sure what to do about this...