Problem with ensDbFromGff using recent ensembl gff files
I'm having troubles with ensDbFromGff for some gff3 files downloaded from Ensembl ftp. For example, Danio rerio in Ensembl versions 94, 95, and 96 (but not Ensembl versions 93, 92, 91, 90, or 85), using the following code:
gff <- "../data/test_files/Danio_rerio.GRCz11.94.gff3.gz" DB <- ensDbFromGff(gff=gff) edb <- EnsDb(DB) edb transcripts(edb)
I get the corresponding warning message:
Warning messages: 1: In
[<-.factor(*tmp*, idx, value = "transcript") : invalid factor level, NA generated
And zero transcripts show up using transcripts(edb)
GRanges object with 0 ranges and 6 metadata columns:
This also happens for Ailuropoda_melanoleuca (works for Ens version 85 but not 91), so I bet this is a widespread problem across organisms.
Thanks for reporting @stephenrong. My quick guess is that there has been a change in the gff file format from Ensembl. Is there any reason you're building the EnsDb from the GFF instead of using the official ones from AnnotationHub?
Hi
@jorainer
I ran into the same problem .
because 'Schizosaccharomyces pombe' does not exist on AnnotationHub. I had to download the GFF file for this species from ensembl.
The version of my ensembldb package is 2.6.8, R 3.5.1 env.
Attach the gff link: [(https://fungi.ensembl.org/Schizosaccharomyces_pombe/Info/Index)]
Here is the demo of GFF file
##gff-version 3
##sequence-region I 1 5579133
##sequence-region II 1 4539804
##sequence-region III 1 2452883
##sequence-region MT 1 19431
#!genome-build PomBase ASM294v2
#!genome-version ASM294v2
#!genome-date 2009-05
#!genome-build-accession GCA_000002945.2
#!genebuild-last-updated 2015-11
I PomBase gene 1 5662 . - . ID=gene:SPAC212.11;Name=tlh1;biotype=protein_coding;description=RecQ type DNA helicase [Source:PomBase%3BAcc:SPAC212.11];gene_id=SPAC212.11;logic_name=pombase
I PomBase mRNA 1 5662 . - . ID=transcript:SPAC212.11.1;Parent=gene:SPAC212.11;Name=tlh1;biotype=protein_coding;transcript_id=SPAC212.11.1
I PomBase exon 1 5662 . - . Parent=transcript:SPAC212.11.1;Name=SPAC212.11.1:exon:1;constitutive=1;ensembl_end_phase=1;ensembl_phase=0;exon_id=SPAC212.11.1:exon:1;rank=1
I PomBase CDS 1 5662 . - 0 ID=CDS:SPAC212.11.1:pep;Parent=transcript:SPAC212.11.1;protein_id=SPAC212.11.1:pep
###
I PomBase chromosome 1 5579133 . . . ID=chromosome:I;Alias=CU329670.1,chr1
###
I PomBase pseudogene 5726 6331 . - . ID=gene:SPAC212.10;Name=SPAC212.10;biotype=pseudogene;gene_id=SPAC212.10;logic_name=pombase
I PomBase pseudogenic_transcript 5726 6331 . - . ID=transcript:SPAC212.10.1;Parent=gene:SPAC212.10;Name=SPAC212.10.1;biotype=pseudogene;transcript_id=SPAC212.10.1
I PomBase exon 5726 6331 . - . Parent=transcript:SPAC212.10.1;Name=SPAC212.10.1:pseudogenic_exon:1;constitutive=1;ensembl_end_phase=0;ensembl_phase=0;exon_id=SPAC212.10.1:pseudogenic_exon:1;rank=1
Thanks for reporting @weir12 . Note that EnsDb databases built from GFF will miss some annotations (because they are not provided in the GFF files) I'd rather suggest to build the object directly from Ensembl with their perl API. Since installing the Perl API and using it is not trivial I could create the EnsDb for you if you want. Just let me know what Ensembl/ensemblgenomes version you need and I'll create it.
@jorainer
Many thanks, to be honest, I just want to get the transcript coordinates corresponding to the genomic coordinates of interest. So I found the function genomeToTranscript of the ensembldb package. Here is the NCBI link about the genome,annotation,transcript I used(https://www.ncbi.nlm.nih.gov/genome/?term=txid4896%5borgn%5d). If I am not mistaken, the genome version is ASM294v2 and the annotation file version is ASM294v2.43 in ensemble.
If there is a more convenient and quick way to solve the problem of coordinate conversion,I am glad to accept that.
thanks again
I've created the EnsDb, you can download it from here.
To use it simply do:
> library(ensembldb)
> edb <- EnsDb("EnsDb.Spombe.v96.sqlite")
> genes(edb)
GRanges object with 7268 ranges and 8 metadata columns:
seqnames ranges strand | gene_id gene_name
<Rle> <IRanges> <Rle> | <character> <character>
SPBC460.01c AB325691 1479-3197 - | SPBC460.01c SPBC460.01c
SPBC460.02c AB325691 8856-9803 - | SPBC460.02c SPBC460.02c
... ... ... ... . ... ...
SPMTR.03 MTR 15593-15721 - | SPMTR.03 mat3-Mm
SPMTR.04 MTR 15856-16401 + | SPMTR.04 mat3-Mc
gene_biotype seq_coord_system
<character> <character>
SPBC460.01c protein_coding chromosome
SPBC460.02c protein_coding chromosome
... ... ...
SPMTR.03 protein_coding chromosome
SPMTR.04 protein_coding chromosome
description
<character>
SPBC460.01c amino-acid permease, unknown [Source:PomBase;Acc:SPBC460.01c]
SPBC460.02c eukaryotic translation elongation factor, glutathione S-transferase (predicted) [Source:PomBase;Acc:SPBC460.02c]
... ...
SPMTR.03 silenced mating-type m-specific polypeptide Mi [Source:PomBase;Acc:SPMTR.03]
SPMTR.04 silenced mating-type m-specific polypeptide Mc [Source:PomBase;Acc:SPMTR.04]
gene_id_version symbol entrezid
<character> <character> <list>
SPBC460.01c SPBC460.01c SPBC460.01c NA
SPBC460.02c SPBC460.02c SPBC460.02c NA
... ... ... ...
SPMTR.03 SPMTR.03 mat3-Mm c(2539637, 3361261)
SPMTR.04 SPMTR.04 mat3-Mc 2540048
-------
seqinfo: 6 sequences from ASM294v2 genome
You can then obviously also use the corrdinate mapping tools - and since it contains protein data you can also map to protein sequence-relative coordinates.
@jorainer Cheers!The problem was solved perfectly.I was really entangled in this problem for a whole day.Sincerely thank you!
Hi Johannes, I face the exact same problem as reported in this thread, but for the organism 'Lottia gigantea'. This organisms does also not exists in the AnnotationHub. Any assistance would be appreciated! Thanks, Guido
http://metazoa.ensembl.org/Lottia_gigantea/Info/Index GFF: ftp://ftp.ensemblgenomes.org/pub/metazoa/release-44/gff3/lottia_gigantea/Lottia_gigantea.Lotgi1.44.gff3.gz
> ensDbFromGff(gff="Lottia_gigantea.Lotgi1.44.gff3.gz", path="./out", organism="Lottia_gigantea", genomeVersion="Lotgi1", version="44")
Importing GFF ... OK
Fixing IDs ... OK
Processing genes ... OK
Processing transcripts ... OK
Processing exons ... OK
-------------
Proceeding to create the database.
Error in connection_connect(dbname, loadable.extensions, flags, vfs) :
Could not connect to database:
unable to open database file
In addition: Warning message:
In `[<-.factor`(`*tmp*`, idx, value = "transcript") :
invalid factor level, NA generated
>
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 30 (Thirty)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> traceback()
10: stop(list(message = "Could not connect to database:\nunable to open database file",
call = connection_connect(dbname, loadable.extensions, flags,
vfs), cppstack = list(file = "", line = -1L, stack = c("/usr/lib64/R/library/RSQLite/libs/RSQLite.so(Rcpp::exception::exception(char const*, bool)+0xa7) [0x7eff9101fbc7]",
"/usr/lib64/R/library/RSQLite/libs/RSQLite.so(void Rcpp::stop<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&)+0x54) [0x7eff9102284d]",
"/usr/lib64/R/library/RSQLite/libs/RSQLite.so(DbConnection::DbConnection(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x95) [0x7eff910222d5]",
"/usr/lib64/R/library/RSQLite/libs/RSQLite.so(connection_connect(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x42) [0x7eff910319f2]",
"/usr/lib64/R/library/RSQLite/libs/RSQLite.so(_RSQLite_connection_connect+0x35a) [0x7eff9102ae8a]",
"/usr/lib64/R/lib/libR.so(+0x139ed9) [0x7effa932ced9]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]",
"/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]",
"/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]",
"/usr/lib64/R/lib/libR.so(+0x14a54c) [0x7effa933d54c]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x2fc) [0x7effa933cd7c]",
"/usr/lib64/R/lib/libR.so(+0x142333) [0x7effa9335333]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]",
"/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(R_execMethod+0x258) [0x7effa933faf8]",
"/usr/lib64/R/library/methods/libs/methods.so(+0x53b2) [0x7eff9709f3b2]",
"/usr/lib64/R/lib/libR.so(+0x18e27f) [0x7effa938127f]", "/usr/lib64/R/lib/libR.so(+0x13c80e) [0x7effa932f80e]",
"/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]",
"/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]",
"/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]",
"/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]",
"/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]",
"/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]",
"/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]",
"/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(R_execMethod+0x258) [0x7effa933faf8]",
"/usr/lib64/R/library/methods/libs/methods.so(+0x53b2) [0x7eff9709f3b2]",
"/usr/lib64/R/lib/libR.so(+0x18e27f) [0x7effa938127f]", "/usr/lib64/R/lib/libR.so(+0x13c80e) [0x7effa932f80e]",
"/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]",
"/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]",
"/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]",
"/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]",
"/usr/lib64/R/lib/libR.so(+0x13e1a5) [0x7effa93311a5]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x1a0) [0x7effa933cc20]",
"/usr/lib64/R/lib/libR.so(+0x14b9bf) [0x7effa933e9bf]", "/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x163) [0x7effa933f843]",
"/usr/lib64/R/lib/libR.so(Rf_eval+0x2ba) [0x7effa933cd3a]",
"/usr/lib64/R/lib/libR.so(+0x14e3e7) [0x7effa93413e7]", "/usr/lib64/R/lib/libR.so(Rf_eval+0x58d) [0x7effa933d00d]",
"/usr/lib64/R/lib/libR.so(Rf_ReplIteration+0x1f2) [0x7effa936e6d2]",
"/usr/lib64/R/lib/libR.so(+0x17ba60) [0x7effa936ea60]", "/usr/lib64/R/lib/libR.so(run_Rmainloop+0x50) [0x7effa936eb20]",
"/usr/lib64/R/bin/exec/R(main+0x1f) [0x55b69a5a30af]", "/lib64/libc.so.6(__libc_start_main+0xf3) [0x7effa9023f33]",
"/usr/lib64/R/bin/exec/R(_start+0x2e) [0x55b69a5a30ee]"))))
9: connection_connect(dbname, loadable.extensions, flags, vfs)
8: initialize(value, ...)
7: initialize(value, ...)
6: new("SQLiteConnection", ptr = connection_connect(dbname, loadable.extensions,
flags, vfs), dbname = dbname, flags = flags, vfs = vfs, loadable.extensions = loadable.extensions,
ref = new.env(parent = emptyenv()), bigint = bigint)
5: .local(drv, ...)
4: dbConnect(dbDriver("SQLite"), dbname = dbname)
3: dbConnect(dbDriver("SQLite"), dbname = dbname)
2: ensDbFromGRanges(theGff, outfile = outfile, path = path, organism = orgFromFile,
genomeVersion = genFromFile, version = ensFromFile, ...)
1: ensDbFromGff(gff = "Lottia_gigantea.Lotgi1.44.gff3.gz", path = "./out",
organism = "Lottia_gigantea", genomeVersion = "Lotgi1", version = "44")
Hi @guidohooiveld , I'll create the EnsDb. I'll let you know when it is ready and from where you can get it.
The EnsDb (Ensembl 97/Ensemblgenomes 44) is available here
Wow, within 1 hr a working database... Thanks very much Johannes, much appreciated!
Let me know if you need an updated version or other too.
Thanks for reporting @stephenrong. My quick guess is that there has been a change in the gff file format from Ensembl. Is there any reason you're building the
EnsDbfrom the GFF instead of using the official ones fromAnnotationHub?
I am struggling with a similar issue. But 'AnnotationHub' doesn't contain information for Pseudomonas aeruginosa
@Olakmephd, bacteria are a little more difficult. Ensembl does not provide a single database for each bacteria, but has databases for a collection of bacteria. I did not yet find a good way to extract the data for a single bacteria from such a database using ensembldb (and the Ensembl Perl API).
You could however download the GFF file for this bacteria e.g. from:
ftp://ftp.ensemblgenomes.org/pub/release-53/bacteria/gff3/bacteria_98_collection/pseudomonas_aeruginosa_gca_003325605 (although I don't know if that's the actuall strain you are looking for) and then create the EnsDb using
> library(ensembldb)
> db <- ensDbFromGff("Pseudomonas_aeruginosa_gca_003325605.ASM332560v1.49.gff3.gz")
Importing GFF ... OK
Fixing IDs ... OK
Processing genes ... OK
Processing transcripts ... OK
Processing exons ... OK
-------------
Proceeding to create the database.
Processing genes ...
Attribute availability:
o gene_id ... OK
o gene_name ... OK
o entrezid ... Nope
o gene_biotype ... OK
OK
Processing transcripts ...
Attribute availability:
o transcript_id ... OK
o gene_id ... OK
o transcript_biotype ... OK
o transcript_name ... Nope
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... FAIL
OK
Processing metadata ... OK
Generating index ... OK
-------------
Verifying validity of the information in the database:
Checking transcripts ... OK
Checking exons ... OK
Warning messages:
1: In `[<-.factor`(`*tmp*`, idx, value = "transcript") :
invalid factor level, NA generated
2: In ensDbFromGRanges(theGff, outfile = outfile, path = path, organism = orgFromFile, :
I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!
3: In ensDbFromGRanges(theGff, outfile = outfile, path = path, organism = orgFromFile, :
I'm missing column(s): 'transcript_name'. The corresponding database columns will be empty!
4: In .getSeqlengthsFromMysqlFolder(organism = organism, ensembl = ensemblVersion, :
Can not get the sequence lengths from Ensembl or Ensemblgenomes. Seqinfo will lack the sequence lengths.
5: In tryGetSeqinfoFromEnsembl(organism, version, seqnames = chroms$seq_name) :
Unable to retrieve sequence lengths from Ensembl.
You can ignore the warning messages. Also, the function fails to retrieve the correct chromosome lengths from Ensembl, so you'll not have that information in the database. You can then load this databases with:
> edb <- EnsDb(db)
> edb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.0.1
|Creation time: Mon May 16 06:06:51 2022
|ensembl_version: 49
|ensembl_host: unknown
|Organism: Pseudomonas_aeruginosa_gca_003325605
|genome_build: ASM332560v1
|DBSCHEMAVERSION: 1.0
|source_file: Pseudomonas_aeruginosa_gca_003325605.ASM332560v1.49.gff3.gz
| No. of genes: 8770.
| No. of transcripts: 8770.
> organism(edb)
[1] "Pseudomonas aeruginosa gca 003325605"
@jorainer I am really sorry to ask you but I have been trying to make EnsDB for "Trichoderma reesei RUT-C30", however, I am getting the following warning messages. Could you please look into it and kindly assist me?
Thanks `> gfffile <- "Trichoderma_reesei_rut_c_30_gca_000513815.TrireRUTC30_1.56.gff3.gz"
DB <- ensDbFromGff(gff = gfffile) Importing GFF ... OK Fixing IDs ... OK Processing genes ... OK Processing transcripts ... OK Processing exons ... OK Proceeding to create the database. Processing genes ... Attribute availability: o gene_id ... OK o gene_name ... OK o entrezid ... Nope o gene_biotype ... OK OK Processing transcripts ... Attribute availability: o transcript_id ... OK o gene_id ... OK o transcript_biotype ... OK o transcript_name ... Nope OK Processing exons ... OK Processing chromosomes ... Fetch seqlengths from ensembl ... FAIL OK Processing metadata ... OK Generating index ... OK Verifying validity of the information in the database: Checking transcripts ... OK Checking exons ... OK Warning messages: 1: In
[<-.factor(*tmp*, idx, value = "transcript") : invalid factor level, NA generated 2: In ensDbFromGRanges(theGff, outfile = outfile, path = path, organism = orgFromFile, : I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty! 3: In ensDbFromGRanges(theGff, outfile = outfile, path = path, organism = orgFromFile, : I'm missing column(s): 'transcript_name'. The corresponding database columns will be empty! 4: In .getSeqlengthsFromMysqlFolder(organism = organism, ensembl = ensemblVersion, : Can not get the sequence lengths from Ensembl or Ensemblgenomes. Seqinfo will lack the sequence lengths. 5: In tryGetSeqinfoFromEnsembl(organism, version, seqnames = chroms$seq_name) : Unable to retrieve sequence lengths from Ensembl. `
@IrshadUlHaq1 , the warnings just tell you that some of the content is not available in the EnsDb you created, because it is not stored/available in the GFF file. The EnsDb should however still be usable.
@jorainer, thank you for your input. The generated 'edb' is usable, however, my end goal is to extract 'entrez' ID for a subset of the 'gene_id'. I need 'entrez' to perform pathway analyses. While I am aware that the 'gff' file does not contain 'entrez' info, I wonder if you have other suggestion for me? I tried to generate the 'edb' using Ensembl Perl API but I ran into problems on my Arch linux.
On another note, could the 'gene_id' be converted to 'entrez' id's by a different package such as 'ClusterProfiler'? Also, the 'gene_id' in the 'edb' are 'stable IDs' and I having a hard time to understand the differences here.
I appreciate your help and time. Again, thank you.
Unfortunately the scripts in ensembldb to create an EnsDb using the Ensembl Perl API don't work properly for bacteria, mostly because Ensembl does not provide a database for a single species/strain (they are bundled in "collections").
I guess it should be possible to convert the IDs in the EnsDb to Entrez IDs - maybe using biomaRt? Or any other package or resource that provides annotations for your species.
The gene_id in EnsDb is usually the Ensembl gene ID (ENSG...), but it depends what identifiers the GFF provides. It will be the gene identifier provided within the GFF file.