ensembldb Problem with ensDbFromGff using recent ensembl gff files

I'm having troubles with ensDbFromGff for some gff3 files downloaded from Ensembl ftp. For example, Danio rerio in Ensembl versions 94, 95, and 96 (but not Ensembl versions 93, 92, 91, 90, or 85), using the following code:

gff <- "../data/test_files/Danio_rerio.GRCz11.94.gff3.gz" DB <- ensDbFromGff(gff=gff) edb <- EnsDb(DB) edb transcripts(edb)

I get the corresponding warning message:

Warning messages: 1: In [<-.factor(*tmp*, idx, value = "transcript") : invalid factor level, NA generated

And zero transcripts show up using transcripts(edb)

GRanges object with 0 ranges and 6 metadata columns:

This also happens for Ailuropoda_melanoleuca (works for Ens version 85 but not 91), so I bet this is a widespread problem across organisms.

Apr 17 '19 20:04 stephenrong

Thanks for reporting @stephenrong. My quick guess is that there has been a change in the gff file format from Ensembl. Is there any reason you're building the EnsDb from the GFF instead of using the official ones from AnnotationHub?

Apr 18 '19 10:04 jorainer

Hi @jorainer I ran into the same problem . because 'Schizosaccharomyces pombe' does not exist on AnnotationHub. I had to download the GFF file for this species from ensembl. The version of my ensembldb package is 2.6.8, R 3.5.1 env. Attach the gff link: [(https://fungi.ensembl.org/Schizosaccharomyces_pombe/Info/Index)] Here is the demo of GFF file

##gff-version 3
##sequence-region   I 1 5579133
##sequence-region   II 1 4539804
##sequence-region   III 1 2452883
##sequence-region   MT 1 19431
#!genome-build PomBase ASM294v2
#!genome-version ASM294v2
#!genome-date 2009-05
#!genome-build-accession GCA_000002945.2
#!genebuild-last-updated 2015-11
I	PomBase	gene	1	5662	.	-	.	ID=gene:SPAC212.11;Name=tlh1;biotype=protein_coding;description=RecQ type DNA helicase [Source:PomBase%3BAcc:SPAC212.11];gene_id=SPAC212.11;logic_name=pombase
I	PomBase	mRNA	1	5662	.	-	.	ID=transcript:SPAC212.11.1;Parent=gene:SPAC212.11;Name=tlh1;biotype=protein_coding;transcript_id=SPAC212.11.1
I	PomBase	exon	1	5662	.	-	.	Parent=transcript:SPAC212.11.1;Name=SPAC212.11.1:exon:1;constitutive=1;ensembl_end_phase=1;ensembl_phase=0;exon_id=SPAC212.11.1:exon:1;rank=1
I	PomBase	CDS	1	5662	.	-	0	ID=CDS:SPAC212.11.1:pep;Parent=transcript:SPAC212.11.1;protein_id=SPAC212.11.1:pep
###
I	PomBase	chromosome	1	5579133	.	.	.	ID=chromosome:I;Alias=CU329670.1,chr1
###
I	PomBase	pseudogene	5726	6331	.	-	.	ID=gene:SPAC212.10;Name=SPAC212.10;biotype=pseudogene;gene_id=SPAC212.10;logic_name=pombase
I	PomBase	pseudogenic_transcript	5726	6331	.	-	.	ID=transcript:SPAC212.10.1;Parent=gene:SPAC212.10;Name=SPAC212.10.1;biotype=pseudogene;transcript_id=SPAC212.10.1
I	PomBase	exon	5726	6331	.	-	.	Parent=transcript:SPAC212.10.1;Name=SPAC212.10.1:pseudogenic_exon:1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=SPAC212.10.1:pseudogenic_exon:1;rank=1

Jun 11 '19 11:06 weir12

Thanks for reporting @weir12 . Note that EnsDb databases built from GFF will miss some annotations (because they are not provided in the GFF files) I'd rather suggest to build the object directly from Ensembl with their perl API. Since installing the Perl API and using it is not trivial I could create the EnsDb for you if you want. Just let me know what Ensembl/ensemblgenomes version you need and I'll create it.

Jun 11 '19 11:06 jorainer

@jorainer Many thanks, to be honest, I just want to get the transcript coordinates corresponding to the genomic coordinates of interest. So I found the function genomeToTranscript of the ensembldb package. Here is the NCBI link about the genome,annotation,transcript I used(https://www.ncbi.nlm.nih.gov/genome/?term=txid4896%5borgn%5d). If I am not mistaken, the genome version is ASM294v2 and the annotation file version is ASM294v2.43 in ensemble. If there is a more convenient and quick way to solve the problem of coordinate conversion，I am glad to accept that. thanks again

Jun 11 '19 11:06 weir12

I've created the EnsDb, you can download it from here.

To use it simply do:

> library(ensembldb)
> edb <- EnsDb("EnsDb.Spombe.v96.sqlite")
> genes(edb)
GRanges object with 7268 ranges and 8 metadata columns:
              seqnames      ranges strand |     gene_id   gene_name
                 <Rle>   <IRanges>  <Rle> | <character> <character>
  SPBC460.01c AB325691   1479-3197      - | SPBC460.01c SPBC460.01c
  SPBC460.02c AB325691   8856-9803      - | SPBC460.02c SPBC460.02c
          ...      ...         ...    ... .         ...         ...
     SPMTR.03      MTR 15593-15721      - |    SPMTR.03     mat3-Mm
     SPMTR.04      MTR 15856-16401      + |    SPMTR.04     mat3-Mc
                gene_biotype seq_coord_system
                 <character>      <character>
  SPBC460.01c protein_coding       chromosome
  SPBC460.02c protein_coding       chromosome
          ...            ...              ...
     SPMTR.03 protein_coding       chromosome
     SPMTR.04 protein_coding       chromosome
                                                                                                                   description
                                                                                                                   <character>
  SPBC460.01c                                                    amino-acid permease, unknown [Source:PomBase;Acc:SPBC460.01c]
  SPBC460.02c eukaryotic translation elongation factor, glutathione S-transferase (predicted) [Source:PomBase;Acc:SPBC460.02c]
          ...                                                                                                              ...
     SPMTR.03                                     silenced mating-type m-specific polypeptide Mi [Source:PomBase;Acc:SPMTR.03]
     SPMTR.04                                     silenced mating-type m-specific polypeptide Mc [Source:PomBase;Acc:SPMTR.04]
              gene_id_version      symbol            entrezid
                  <character> <character>              <list>
  SPBC460.01c     SPBC460.01c SPBC460.01c                  NA
  SPBC460.02c     SPBC460.02c SPBC460.02c                  NA
          ...             ...         ...                 ...
     SPMTR.03        SPMTR.03     mat3-Mm c(2539637, 3361261)
     SPMTR.04        SPMTR.04     mat3-Mc             2540048
  -------
  seqinfo: 6 sequences from ASM294v2 genome

You can then obviously also use the corrdinate mapping tools - and since it contains protein data you can also map to protein sequence-relative coordinates.

Jun 11 '19 13:06 jorainer

@jorainer Cheers!The problem was solved perfectly.I was really entangled in this problem for a whole day.Sincerely thank you！

Jun 11 '19 14:06 weir12

Hi Johannes, I face the exact same problem as reported in this thread, but for the organism 'Lottia gigantea'. This organisms does also not exists in the AnnotationHub. Any assistance would be appreciated! Thanks, Guido

http://metazoa.ensembl.org/Lottia_gigantea/Info/Index GFF: ftp://ftp.ensemblgenomes.org/pub/metazoa/release-44/gff3/lottia_gigantea/Lottia_gigantea.Lotgi1.44.gff3.gz

> ensDbFromGff(gff="Lottia_gigantea.Lotgi1.44.gff3.gz", path="./out", organism="Lottia_gigantea", genomeVersion="Lotgi1", version="44")
Importing GFF ... OK
Fixing IDs ... OK
Processing genes ... OK
Processing transcripts ... OK
Processing exons ... OK
  -------------
Proceeding to create the database.
Error in connection_connect(dbname, loadable.extensions, flags, vfs) : 
  Could not connect to database:
unable to open database file
In addition: Warning message:
In `[<-.factor`(`*tmp*`, idx, value = "transcript") :
  invalid factor level, NA generated
> 
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 30 (Thirty)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

> traceback()
10: stop(list(message = "Could not connect to database:\nunable to open database file", 
        call = connection_connect(dbname, loadable.extensions, flags, 
            vfs), cppstack = list(file = "", line = -1L, stack = c("/usr/lib64/R/library/RSQLite/libs/RSQLite.so(Rcpp::exception::exception(char const*, bool)+0xa7) [0x7eff9101fbc7]", 
        "/usr/lib64/R/library/RSQLite/libs/RSQLite.so(void Rcpp::stop<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&)+0x54) [0x7eff9102284d]", 
        "/usr/lib64/R/library/RSQLite/libs/RSQLite.so(DbConnection::DbConnection(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x95) [0x7eff910222d5]", 
        "/usr/lib64/R/library/RSQLite/libs/RSQLite.so(connection_connect(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x42) [0x7eff910319f2]", 
        "/usr/lib64/R/library/RSQLite/libs/RSQLite.so(_RSQLite_connection_connect+0x35a) [0x7eff9102ae8a]", 
        "/usr/lib64/R/lib/libR.so(+0x139ed9) [0x7effa932ced9]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]", 
        "/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14a54c) [0x7effa933d54c]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x2fc) [0x7effa933cd7c]", 
        "/usr/lib64/R/lib/libR.so(+0x142333) [0x7effa9335333]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(R_execMethod+0x258) [0x7effa933faf8]", 
        "/usr/lib64/R/library/methods/libs/methods.so(+0x53b2) [0x7eff9709f3b2]", 
        "/usr/lib64/R/lib/libR.so(+0x18e27f) [0x7effa938127f]", "/usr/lib64/R/lib/libR.so(+0x13c80e) [0x7effa932f80e]", 
        "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]", 
        "/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]", 
        "/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]", 
        "/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(R_execMethod+0x258) [0x7effa933faf8]", 
        "/usr/lib64/R/library/methods/libs/methods.so(+0x53b2) [0x7eff9709f3b2]", 
        "/usr/lib64/R/lib/libR.so(+0x18e27f) [0x7effa938127f]", "/usr/lib64/R/lib/libR.so(+0x13c80e) [0x7effa932f80e]", 
        "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]", 
        "/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]", 
        "/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]", 
        "/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]", 
        "/usr/lib64/R/lib/libR.so(Rf_eval+0x2ba) [0x7effa933cd3a]", 
        "/usr/lib64/R/lib/libR.so(+0x14e3e7) [0x7effa93413e7]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x58d) [0x7effa933d00d]", 
        "/usr/lib64/R/lib/libR.so(Rf_ReplIteration+0x1f2) [0x7effa936e6d2]", 
        "/usr/lib64/R/lib/libR.so(+0x17ba60) [0x7effa936ea60]", "/usr/lib64/R/lib/libR.so(run_Rmainloop+0x50) [0x7effa936eb20]", 
        "/usr/lib64/R/bin/exec/R(main+0x1f) [0x55b69a5a30af]", "/lib64/libc.so.6(__libc_start_main+0xf3) [0x7effa9023f33]", 
        "/usr/lib64/R/bin/exec/R(_start+0x2e) [0x55b69a5a30ee]"))))
9: connection_connect(dbname, loadable.extensions, flags, vfs)
8: initialize(value, ...)
7: initialize(value, ...)
6: new("SQLiteConnection", ptr = connection_connect(dbname, loadable.extensions, 
       flags, vfs), dbname = dbname, flags = flags, vfs = vfs, loadable.extensions = loadable.extensions, 
       ref = new.env(parent = emptyenv()), bigint = bigint)
5: .local(drv, ...)
4: dbConnect(dbDriver("SQLite"), dbname = dbname)
3: dbConnect(dbDriver("SQLite"), dbname = dbname)
2: ensDbFromGRanges(theGff, outfile = outfile, path = path, organism = orgFromFile, 
       genomeVersion = genFromFile, version = ensFromFile, ...)
1: ensDbFromGff(gff = "Lottia_gigantea.Lotgi1.44.gff3.gz", path = "./out", 
       organism = "Lottia_gigantea", genomeVersion = "Lotgi1", version = "44")

Jul 11 '19 12:07 guidohooiveld

Hi @guidohooiveld , I'll create the EnsDb. I'll let you know when it is ready and from where you can get it.

Jul 11 '19 12:07 jorainer

The EnsDb (Ensembl 97/Ensemblgenomes 44) is available here

Jul 11 '19 13:07 jorainer

Wow, within 1 hr a working database... Thanks very much Johannes, much appreciated!

Jul 11 '19 13:07 guidohooiveld

Let me know if you need an updated version or other too.

Jul 11 '19 13:07 jorainer

Thanks for reporting @stephenrong. My quick guess is that there has been a change in the gff file format from Ensembl. Is there any reason you're building the EnsDb from the GFF instead of using the official ones from AnnotationHub?

I am struggling with a similar issue. But 'AnnotationHub' doesn't contain information for Pseudomonas aeruginosa

May 13 '22 13:05 Dataxplora

@Olakmephd, bacteria are a little more difficult. Ensembl does not provide a single database for each bacteria, but has databases for a collection of bacteria. I did not yet find a good way to extract the data for a single bacteria from such a database using ensembldb (and the Ensembl Perl API).

You could however download the GFF file for this bacteria e.g. from: ftp://ftp.ensemblgenomes.org/pub/release-53/bacteria/gff3/bacteria_98_collection/pseudomonas_aeruginosa_gca_003325605 (although I don't know if that's the actuall strain you are looking for) and then create the EnsDb using

> library(ensembldb)
> db <- ensDbFromGff("Pseudomonas_aeruginosa_gca_003325605.ASM332560v1.49.gff3.gz")
Importing GFF ... OK
Fixing IDs ... OK
Processing genes ... OK
Processing transcripts ... OK
Processing exons ... OK
  -------------
Proceeding to create the database.
Processing genes ... 
 Attribute availability:
  o gene_id ... OK
  o gene_name ... OK
  o entrezid ... Nope
  o gene_biotype ... OK
OK
Processing transcripts ... 
 Attribute availability:
  o transcript_id ... OK
  o gene_id ... OK
  o transcript_biotype ... OK
  o transcript_name ... Nope
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... FAIL
OK
Processing metadata ... OK
Generating index ... OK
  -------------
Verifying validity of the information in the database:
Checking transcripts ... OK
Checking exons ... OK
Warning messages:
1: In `[<-.factor`(`*tmp*`, idx, value = "transcript") :
  invalid factor level, NA generated
2: In ensDbFromGRanges(theGff, outfile = outfile, path = path, organism = orgFromFile,  :
   I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!
3: In ensDbFromGRanges(theGff, outfile = outfile, path = path, organism = orgFromFile,  :
  I'm missing column(s): 'transcript_name'. The corresponding database columns will be empty!
4: In .getSeqlengthsFromMysqlFolder(organism = organism, ensembl = ensemblVersion,  :
  Can not get the sequence lengths from Ensembl or Ensemblgenomes. Seqinfo will lack the sequence lengths.
5: In tryGetSeqinfoFromEnsembl(organism, version, seqnames = chroms$seq_name) :
  Unable to retrieve sequence lengths from Ensembl.

You can ignore the warning messages. Also, the function fails to retrieve the correct chromosome lengths from Ensembl, so you'll not have that information in the database. You can then load this databases with:

> edb <- EnsDb(db)
> edb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.0.1
|Creation time: Mon May 16 06:06:51 2022
|ensembl_version: 49
|ensembl_host: unknown
|Organism: Pseudomonas_aeruginosa_gca_003325605
|genome_build: ASM332560v1
|DBSCHEMAVERSION: 1.0
|source_file: Pseudomonas_aeruginosa_gca_003325605.ASM332560v1.49.gff3.gz
| No. of genes: 8770.
| No. of transcripts: 8770.
> organism(edb)
[1] "Pseudomonas aeruginosa gca 003325605"

May 16 '22 06:05 jorainer

@jorainer I am really sorry to ask you but I have been trying to make EnsDB for "Trichoderma reesei RUT-C30", however, I am getting the following warning messages. Could you please look into it and kindly assist me?

Thanks `> gfffile <- "Trichoderma_reesei_rut_c_30_gca_000513815.TrireRUTC30_1.56.gff3.gz"

DB <- ensDbFromGff(gff = gfffile) Importing GFF ... OK Fixing IDs ... OK Processing genes ... OK Processing transcripts ... OK Processing exons ... OK Proceeding to create the database. Processing genes ... Attribute availability: o gene_id ... OK o gene_name ... OK o entrezid ... Nope o gene_biotype ... OK OK Processing transcripts ... Attribute availability: o transcript_id ... OK o gene_id ... OK o transcript_biotype ... OK o transcript_name ... Nope OK Processing exons ... OK Processing chromosomes ... Fetch seqlengths from ensembl ... FAIL OK Processing metadata ... OK Generating index ... OK Verifying validity of the information in the database: Checking transcripts ... OK Checking exons ... OK Warning messages: 1: In [<-.factor(*tmp*, idx, value = "transcript") : invalid factor level, NA generated 2: In ensDbFromGRanges(theGff, outfile = outfile, path = path, organism = orgFromFile, : I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty! 3: In ensDbFromGRanges(theGff, outfile = outfile, path = path, organism = orgFromFile, : I'm missing column(s): 'transcript_name'. The corresponding database columns will be empty! 4: In .getSeqlengthsFromMysqlFolder(organism = organism, ensembl = ensemblVersion, : Can not get the sequence lengths from Ensembl or Ensemblgenomes. Seqinfo will lack the sequence lengths. 5: In tryGetSeqinfoFromEnsembl(organism, version, seqnames = chroms$seq_name) : Unable to retrieve sequence lengths from Ensembl. `

Apr 25 '23 20:04 IrshadUlHaq1

@IrshadUlHaq1 , the warnings just tell you that some of the content is not available in the EnsDb you created, because it is not stored/available in the GFF file. The EnsDb should however still be usable.

Apr 26 '23 09:04 jorainer

@jorainer, thank you for your input. The generated 'edb' is usable, however, my end goal is to extract 'entrez' ID for a subset of the 'gene_id'. I need 'entrez' to perform pathway analyses. While I am aware that the 'gff' file does not contain 'entrez' info, I wonder if you have other suggestion for me? I tried to generate the 'edb' using Ensembl Perl API but I ran into problems on my Arch linux.

On another note, could the 'gene_id' be converted to 'entrez' id's by a different package such as 'ClusterProfiler'? Also, the 'gene_id' in the 'edb' are 'stable IDs' and I having a hard time to understand the differences here.

I appreciate your help and time. Again, thank you.

May 01 '23 15:05 IrshadUlHaq1

Unfortunately the scripts in ensembldb to create an EnsDb using the Ensembl Perl API don't work properly for bacteria, mostly because Ensembl does not provide a database for a single species/strain (they are bundled in "collections").

I guess it should be possible to convert the IDs in the EnsDb to Entrez IDs - maybe using biomaRt? Or any other package or resource that provides annotations for your species.

The gene_id in EnsDb is usually the Ensembl gene ID (ENSG...), but it depends what identifiers the GFF provides. It will be the gene identifier provided within the GFF file.

May 03 '23 06:05 jorainer