annotatr icon indicating copy to clipboard operation
annotatr copied to clipboard

Mismatch between transcripts and genes

Open javiercguard opened this issue 1 year ago • 1 comments

Hi, I'm using annotatr and I've noticed in my annotation a transcript that is associated to the wrong gene.

It is ENST00000580062.5, which is listed (in the annotation) as a transcript for SNRPN (entrez/gene id: 6638). However, while I was performing joins between my annotation and ensembl data I noticed that this transcript is from the gene SNURF (entrez/gene id: 8926), as shown on ensembl (https://www.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000273173;r=15:24954987-24977850).

I'm not sure if the annotation's source is wrong, but I've checked ensembl's archive and it this transcript has always been assigned to SNURF, so at least I don't think an outdated package on my side is causing this.

My version of annotatr is 1.26.0

To reproduce this:

library(annotatr)
library(AnnotationHub)
# BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene", lib = .libPaths()[1]) # I've run this
testAnnots = "hg38_genes_cds"
testBuiltAnnotation = annotatr::build_annotations(genome = "hg38", annotations = testAnnots)
testAnnotation = annotatr::annotate_regions(
    regions = GenomicRanges::makeGRangesFromDataFrame( # a test region that overlaps the transcript of interest
      data.frame(CHROM = "chr15", POS = 24955049, end = 24955065),
      seqnames.field = "CHROM",
      start.field = "POS",
      end.field = "end",
      keep.extra.columns = T
    ),
    annotations = testBuiltAnnotation,
    ignore.strand = T,
    quiet = F
  )
testDf = as.data.frame(testAnnotation)
print(testDf[testDf$annot.tx_id == "ENST00000580062.5", ])

Which prints:

  seqnames    start      end width strand annot.seqnames annot.start annot.end annot.width annot.strand   annot.id
2    chr15 24955049 24955065    17      *          chr15    24955049  24955062          14            + CDS:618534
        annot.tx_id annot.gene_id annot.symbol     annot.type
2 ENST00000580062.5          6638        SNRPN hg38_genes_cds

javiercguard avatar Mar 06 '24 17:03 javiercguard

Thanks for bringing this up. It is a bit odd; when I reach into the TxDb object for hg38 and extract the transcript with Ensembl transcript ID ENST00000580062.5 I get:

> tx_gr[tx_gr$TXNAME == 'ENST00000580062.5']
GRanges object with 1 range and 3 metadata columns:
      seqnames            ranges strand |      TXID          GENEID            TXNAME
         <Rle>         <IRanges>  <Rle> | <integer> <CharacterList>       <character>
  [1]    chr15 24955004-24977850      + |    174731            6638 ENST00000580062.5

Suggesting that the annotation in the TxDb package is the source of what you're seeing. Unfortunately, I'm not sure what to do about this...

rcavalcante avatar Jun 04 '24 15:06 rcavalcante