ensembldb icon indicating copy to clipboard operation
ensembldb copied to clipboard

missing data in human EnsDb v112?

Open guidohooiveld opened this issue 1 year ago • 14 comments

Hi Johannes, When working with the human EnsDb version 112 I obtained an unexpected result. Specifically, it seems a substantial number of genes (and transcripts) are not included in the current EnsDb. This seems not to be the case with EnsDb version 111 (the previous one). Moreover, when manually checking some of the 'missing' genes I found that they are still listed/included on the v112 Ensembl website. See below for details.

Question: did I oversee something and is this expected behavior, or is indeed something wrong?

Also note: | No. of genes differs a lot between EnsDb v111 and v112 (72035 and 64102). Idem for | No. of transcripts (278721 and 215647). Also note 2: EnsDb v112 has tag |genome_build: GRCh37, but is should be GRCh38 (like v111), right?

Thanks for your feedback! G

Background: For a recent sequencing experiment (human samples) I mapped the reads to the latest GENCODE release (= release 46) using the tool salmon. Next I imported all counts in R/Bioconductor using tximeta. So far, so good!

To add annotations I use the EnsDb. GENCODE release 46 corresponds to Ensembl-release 112. I thus downloaded human EnsDb v112 from the AnnotationHub. See: https://github.com/jorainer/ensembldb/issues/154#issuecomment-2165908661

Yet, after annotating all genes I noticed I 'lost' quite a lot of genes. This is unexpected, because this did not happen before. Moreover, this is not the case when using EnsDb v111 for annotating.

Code to illustrate this:

> ## define not-in help function
> `%!in%` = Negate(`%in%`)
> 
> ##load required libraries
> library(tximeta)
> library(EnsDb.Hsapiens.v112)
> 
> ## check EnsDb; No. of genes: 64102.
> EnsDb.Hsapiens.v112
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Fri May 17 11:25:50 2024
|ensembl_version: 112
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh37
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 64102.
| No. of transcripts: 215647.
|Protein data available.
> 
> ## check content tximeta object
> ## Number of genes is 63086
> gse.lengthscaled
class: RangedSummarizedExperiment 
dim: 63086 98 
metadata(7): tximetaInfo quantInfo ... txdbInfo assignRanges
assays(103): counts abundance ... infRep99 infRep100
rownames(63086): ENSG00000000003.16 ENSG00000000005.6 ...
  ENSG00000293599.1 ENSG00000293600.2
rowData names(6): gene_id tx_ids ... max_prop iso_prop
colnames(98): P02D1 P02D2 ... P51D1 P51D2
colData names(10): names SubjectID ... SampleName SalmonFile
> 
> ## retrieve all keys from EnsDb112, and check numbers
> ## everything makes sense!
> keys.EnsDb112 <- keys(EnsDb.Hsapiens.v112)
> length(!duplicated(keys.EnsDb112))
[1] 64102
> sum(!duplicated(keys.EnsDb112))
[1] 64102
> 
> ## do the same for tximeta object
> input.genes <- mcols(gse.lengthscaled)$gene_id
> length(!duplicated(input.genes))
[1] 63086
> sum(!duplicated(input.genes))
[1] 63086
> head(input.genes)
[1] "ENSG00000000003.16" "ENSG00000000005.6"  "ENSG00000000419.14"
[4] "ENSG00000000457.14" "ENSG00000000460.17" "ENSG00000000938.13"
> 
> ## remove version ids; nothing strange happened
> input.genes <- sub("\\..*", "", input.genes)
> length(!duplicated(input.genes))
[1] 63086
> sum(!duplicated(input.genes))
[1] 63086
> head(input.genes)
[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
[5] "ENSG00000000460" "ENSG00000000938"
> 
> ## how many input genes are in EnsDb112?
> ## only 52083? That is strange!
> present.in.db.112 <- input.genes %in% keys.EnsDb112 
> length(present.in.db.112)
[1] 63086
> sum(present.in.db.112)
[1] 52083
> 
> ## number of not annotated genes (11003!):
> length(present.in.db.112) - sum(present.in.db.112)
[1] 11003
> 
> 
> 
> ## list 12 genes that could not be annotated:
> NOT.present.in.db.112 <- input.genes %!in% keys.EnsDb112 
> length(NOT.present.in.db.112)
[1] 63086
> sum(NOT.present.in.db.112)
[1] 11003
> 
> head( input.genes[NOT.present.in.db.112] )
[1] "ENSG00000273497" "ENSG00000273499" "ENSG00000273500" "ENSG00000273507"
[5] "ENSG00000273509" "ENSG00000273512"
> tail( input.genes[NOT.present.in.db.112] )
[1] "ENSG00000293594" "ENSG00000293595" "ENSG00000293596" "ENSG00000293597"
[5] "ENSG00000293599" "ENSG00000293600"
> 
>

Yet, these not-annotated genes are listed on the ensembl website...!? (despite likely being pseudo-genes etc).

ENSG00000273497 link ENSG00000273512 link

ENSG00000293594 link ENSG00000293600 link

> ## repeat, but with previous EnsDb: version 111
> library(EnsDb.Hsapiens.v111)
> EnsDb.Hsapiens.v111
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Tue Jan 16 10:37:47 2024
|ensembl_version: 111
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 72035.
| No. of transcripts: 278721.
|Protein data available.
> 
> 
> keys.EnsDb111 <- keys(EnsDb.Hsapiens.v111)
> length(!duplicated(keys.EnsDb111))
[1] 72035
> sum(!duplicated(keys.EnsDb111))
[1] 72035
> 
> ## how many input genes are in EnsDb111?
> ## Almost all!
> present.in.db.111 <- input.genes %in% keys.EnsDb111 
> length(present.in.db.111)
[1] 63086
> sum(present.in.db.111)
[1] 63060
> 
> ## number of not annotated genes (only 26!)
> length(present.in.db.111) - sum(present.in.db.111)
[1] 26
> 
>

guidohooiveld avatar Jun 14 '24 12:06 guidohooiveld

I found a terrorible mistake... genome_build: GRCh37 for v112 and genome_build: GRCh38 for v111.

I understand that liftover can take care of it. But I still prefer to use GRCh38 version...

Do you have any idea how to change the annotation database to GRCh38 version?

Update: Sorry for the redundant comment. I think I found how to make it. AnnotationHub::query(ah, pattern = c("EnsDb","Homo sapiens", "GRCh38"))

> AnnotationHub::query(ah, pattern = c("EnsDb","Homo sapiens","GRCh38"))
AnnotationHub with 26 records
# snapshotDate(): 2024-04-30
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
#   tags, rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

             title                             
  AH53211  | Ensembl 87 EnsDb for Homo Sapiens 
  AH53715  | Ensembl 88 EnsDb for Homo Sapiens 
  AH56681  | Ensembl 89 EnsDb for Homo Sapiens 
  AH57757  | Ensembl 90 EnsDb for Homo Sapiens 
  AH60773  | Ensembl 91 EnsDb for Homo Sapiens 
  ...        ...                               
  AH104864 | Ensembl 107 EnsDb for Homo sapiens
  AH109336 | Ensembl 108 EnsDb for Homo sapiens
  AH109606 | Ensembl 109 EnsDb for Homo sapiens
  AH113665 | Ensembl 110 EnsDb for Homo sapiens
  AH116291 | Ensembl 111 EnsDb for Homo sapiens
> AnnotationHub::query(ah, pattern = c("EnsDb","Homo sapiens"))
AnnotationHub with 27 records
# snapshotDate(): 2024-04-30
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
#   tags, rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

             title                             
  AH53211  | Ensembl 87 EnsDb for Homo Sapiens 
  AH53715  | Ensembl 88 EnsDb for Homo Sapiens 
  AH56681  | Ensembl 89 EnsDb for Homo Sapiens 
  AH57757  | Ensembl 90 EnsDb for Homo Sapiens 
  AH60773  | Ensembl 91 EnsDb for Homo Sapiens 
  ...        ...                               
  AH109336 | Ensembl 108 EnsDb for Homo sapiens
  AH109606 | Ensembl 109 EnsDb for Homo sapiens
  AH113665 | Ensembl 110 EnsDb for Homo sapiens
  AH116291 | Ensembl 111 EnsDb for Homo sapiens
  AH116860 | Ensembl 112 EnsDb for Homo sapiens

Currently, EnsemblDB v112 is not available for GRCh38.

Maybe this will affect the output?

zyh4482 avatar Jun 17 '24 01:06 zyh4482

As a result, using the latest EnsemblDB v112, I could map fewer genes.

# 66214 found
# 65692/77379 OK

But using EnsemblDB v111, I could get most of my genes.

77373 found
77306/77379 OK

Is there anyway to fix it?

zyh4482 avatar Jun 17 '24 03:06 zyh4482

This is a bit surprising. I will dig a bit into this. I do not specifically set any Genome build version (AFAIK) but simply querying data from Ensembl using the perl API and putting that into SQLite databases (that are shared through AnnotationHub).

jorainer avatar Jun 17 '24 05:06 jorainer

Uh, yes, Ensembl release 112 has two core databases, one for GRCH37 and one for GRCH38. My automatic script picked up the first, which happened to be the GRCH37 :( . I'll create the correct version and fix/update the file in AnnotationHub.

jorainer avatar Jun 17 '24 07:06 jorainer

Thanks for having a look at this discrepancy, and finding and fixing the source of it so quickly. :)

Since processing of files on the AnnotationHub apparently may take some time; would you mind making the fixed database temporarily available through e.g. Dropbox as well? I/we can then fire-up our pipelines right away. :)

guidohooiveld avatar Jun 17 '24 07:06 guidohooiveld

Sure - I'm just at a conference at present and bandwidth is limited ... so all takes a bit longer. but I'll post a link here once I'm done (seems only the human annotations were affected, so I'll replace that)

jorainer avatar Jun 17 '24 07:06 jorainer

Thank you so much!

zyh4482 avatar Jun 17 '24 07:06 zyh4482

thanks for spotting this issue! I was not aware (and was not expecting) that Ensembl creates now two flavors of annotations, one for GRCH37 and one for GRCH38!

jorainer avatar Jun 17 '24 07:06 jorainer

so, the EnsDb for GrCh38 is available here. I'm also updating/replacing AnnotationHub, but we have to figure out how to best do that.

jorainer avatar Jun 18 '24 00:06 jorainer

Thanks for the updated file. I have downloaded it, and based on my tests it looks fine now. That is, all my input genes (63086) are being annotated now (in the previous, GRCH37-based version 11003 genes were missing). I will do some additional tests (with transcripts) and report back if needed.

Updated database:

> EnsDb.Hsapiens.v112
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Mon Jun 17 16:14:46 2024
|ensembl_version: 112
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 71935.
| No. of transcripts: 279860.
|Protein data available.
>

guidohooiveld avatar Jun 18 '24 06:06 guidohooiveld

Also obtained expected results on the level of transcripts, so as far as I am concerned the updated database is correct.

@zyh4482 : what about the outcomes of your analyses?

guidohooiveld avatar Jun 18 '24 14:06 guidohooiveld

For my dataset, I got:

Fetching CDS for 77379 proteins ... 77379 found
Checking CDS and protein sequence lengths ... 77312/77379 OK

At this stage, everything looks perfect.

Previously, I used v111 for my downstream analysis. I found some discrepancies regarding to unmapped genomic coordinates. I'll update my results later.

zyh4482 avatar Jun 19 '24 10:06 zyh4482

I finished my analysis. This database is perfect to use. Thanks again! @jorainer

zyh4482 avatar Jun 20 '24 02:06 zyh4482

Just to update: AH (both the 3.19 release and devel branches) should now also provide the database with the correct genome release.

jorainer avatar Jul 08 '24 05:07 jorainer