biomartr
biomartr copied to clipboard
Issue after Recovering
Hi Hajk, I want to use meta.retrieval to get bacterial proteome from ensemblgenomes. There was some connection problems on my Mac first. Then I tried to recover this work on another Ubuntu, but I got a error:
biomartr::meta.retrieval(kingdom = "EnsemblBacteria", db = "ensemblgenomes", type = "proteome") Starting meta retrieval of all proteome files for kingdom: EnsemblBacteria from database: ensemblgenomes. Skipping already downloaded species: Acetobacterium_bakii, Acetobacter_orleanensis_jcm_7639, Acetobacter_pasteurianus_386b, Acetobacter_pasteurianus_subsp_ascendens, Acetobacter_pomorum, Acholeplasma_oculi, Achromatium_sp_wms1, Achromobacter_arsenitoxydans_sy8, Achromobacter_piechaudii_hle, Achromobacter_sp_dms1, Achromobacter_sp_root83, Achromobacter_sp_rta, Achromobacter_xylosoxidans_a8 Starting proteome retrieval of 'Chryseobacterium sp hurlbut01' from ensemblgenomes ... Error: The accession id or taxid you used did not have a reference genome in the ENSEMBLGENOMES database. Please check is.genome.available() to retrieve accession ids and taxids of available genome assemblies. In addition: Warning message: In (which_found - 1):length(.data) : numerical expression has 2 elements: only the first used
I can't figure out the problem.
One more thing, every time when I recover unfinished retrieval it seems to cost quite long time to scan previous job. I wonder if we can just let the program leave them out and continue? It will save a lot of time.
By the way, I like your package so much. Please keep updating!
Best regards, Chi
Hi @3719left,
Thank you so much for making me aware of this issue.
The reason why it might have stopped working could be due to the naming convention for bacteria. Unfortunately, the scientific naming for bacteria is still a wild west, because they include strain numbers, strain names, dots, (, [, etc without any standard or convention (this is also true for viruses). This is very hard to standardize and so whenever I build a data retrieval query using the scientific name, sometimes the API doesn't recognize it and thus the retrieval stops.
I tried to somewhat overcome this limitation now buy revising my retrieval backend which should now work for your command:
biomartr::meta.retrieval(kingdom = "EnsemblBacteria", db = "ensemblgenomes", type = "proteome")
In the next days, I will also revise all other types such as genome, cds, etc.
Please let me know if it works for you now using the developer version of biomartr:
source("http://bioconductor.org/biocLite.R")
biocLite("ropensci/biomartr")
Regarding your request for restarting retrieval jobs and where they pick up again:
The reason why I start over again is to check if the corresponding fasta file has actually been retrieved by checking the md5cheksum of the file. It could be that the function stopped at some point and didn't actually retrieve the entire fasta file. Thus, the folder name and file might exist,
but the corresponding fasta file might actually be corrupted. To make sure that this doesn't
happen I start crawling over again from the beginning to check the md5cheksum of already retrieved files. If it is useful for you anyway, I can implement an argument restart_at_last which can be set to TRUE or FALSE and in case restart_at_last = TRUE it will then start over only at the last position. In that case, please be aware that it might be that some fasta files are corrupted due to the initial loss of connection.
I hope this helps :)
Many thanks and kind regards, Hajk
Hi HajkD, Thanks for the debugging and explaining. I've tried again and it works (it take 2.5 to finish). I successfully downloaded 25 Gb protein data and before this it stopped after 2.4 Gb downloaded. And thanks for the restart argument. I will use it wisely.
About the "retrieval query using the scientific name", I don't know how it works? Is there a name list that you update mannually for each database? Or the API extract the name of species from the website of each database automatically every time we run biomartr?
By the way, metaretrieval from RefSeq didn't work,
Error: The FTP site 'ftp://ftp.ncbi.nlm.nih.gov/' cannot be reached. Are you connected to the internet? Is the the FTP site 'NA/NA_protein.faa.gz' currently available?
It was the same problem? Or just their FTP failed?
Best regards Chi
Hi @3719left,
I am happy that it now works for you :)
Regarding the NCBI RefSeq retrieval:
Could you please provide me with an example that doesn't work after installing the most recent developer version?
In the past days, I have almost completely remodeled the retrieval backend and it might have been that something is still broken. So far, my example of a bacterial meta.retrieval() from NCBI RefSeq worked seamlessly.
Regarding "retrieval query using the scientific name":
- All queries are automated and use a json application request to the NCBI/ENSEMBL APIs. So no, nothing is done manually. However, there are inconsistencies in the API return values and the server folder structures at NCBI and ENSEMBL that I need to work around. I will make the developers there aware of the issue, but this is the reason why certain scientific names are not supported.
I hope this helps!
Best wishes, Hajk
Hi Hajk, Thanks for your reply. Here's a new error:
biomartr::meta.retrieval(kingdom = "EnsemblBacteria", db = "ensemblgenomes", type = "genome") Starting meta retrieval of all genome files for kingdom: EnsemblBacteria from database: ensemblgenomes.
Starting genome retrieval of 'GCA_001187685.1' from ensemblgenomes ...
Error: The API 'http://rest.ensemblgenomes.org' does not seem to work properly. Are you connected to the internet? Is the homepage 'http://rest.ensemblgenomes.org' currently available?
Basically I updated to the newest developer version before everything. Then I did the metaretrieval and I got the error. I checked the API link and it is available.
One more thing, I noticed that you made a new function getCollection() and that is fabulous. That just makes me think can we let it work in a batch manner. For example, previously I did a meta-retrieval of the entire proteome from EnsemblBacteria Database and then through a HMMR search I got all proteins that I was interested in. Now I want to extract their whole genome data and that's like 970 species in various subgroup of bacteria. Obviously I don't want to retrieve one after another, so I have to do download the entire EnsemblBacteria Genome and extract those 970 out. So if we can make "organism" argument of getCollection read a list (scientific name, accession id or taxonomic identifier), that would be perfect.
Oh, I almost forget to mention that I just tried meta.retrieval() from NCBI RefSeq, it ran very well for the first 10 mins, then:
|=================================================================| 100% 34 MB Proteome download is completed! Checking md5 hash of file: bacteria/Isosphaera_pallida_ATCC_43644_md5checksums.txt ... The md5 hash of file 'bacteria/Isosphaera_pallida_ATCC_43644_md5checksums.txt' matches! The proteome of 'Isosphaera_pallida_ATCC_43644' has been downloaded to 'bacteria' and has been named 'Isosphaera_pallida_ATCC_43644_protein_refseq.faa.gz' . Starting proteome retrieval of 'Methylocystis parvus OBBP' from refseq ...
|=================================================================| 100% 34 MB Error: The FTP site 'ftp://ftp.ncbi.nlm.nih.gov/' cannot be reached. Are you connected to the internet? Is the the FTP site 'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/283/235/GCF_000283235.1_MetPar_1.0/GCF_000283235.1_MetPar_1.0_protein.faa.gz' currently available?
It looks like the same problem and that ftp link is available. I guess that meta-retrieval triggers the FTP connection limits.
Anyway, I appreciate your effort and am looking forward to your reply.
Best regards, Chi
Hi Chi,
Thank you once more very much for making me aware of these issues.
It is strange that you get errors with biomartr::meta.retrieval(kingdom = "EnsemblBacteria", db = "ensemblgenomes", type = "genome"), because when I am running it here on my iMac it works seamlessly. Just to make sure that the most recent version of biomartr was installed on your machine could you please send me a sessionInfo() of your system?
Does the command start with the message Starting retrieval of information for all species stored in ENSEMBLGENOMES... This needs to be done only once.?
Regarding your other points:
I am planning on introducing a new set of functions named getGenomeSet(), getProteomeSet(), etc to be able to specify a vector containing scientific names, taxids, etc to batch download genomes, proteomes, collections, etc for a set of organisms, so this might be exactly the functionality you asked for :)
For the other issue with the NCBI refseq FTP retrieval all I can say is that it seems that NCBI limits the query requests and there is nothing I can do about this :( All I can recommend is to re-start the command and it will pick up from where it left off. In case it still doesn't work, simply re-run the command the next day. Usually, the NCBI query ban is limited to a few hours as far as I observe.
Please let me know if this helps.
Kind regards, Hajk
Hi Hajk, The commands are run on a Ubuntu platform actually. Maybe that's the reason it doesn't work so well?
sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.4 LTS
Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached): [1] compiler_3.5.0
It does start with the species retrieval message.
Those upcoming functions are inspiring. Thanks for your continuous maintenance and updating.
I will wait until a lucky day and try the RefSeq retrieval.
This definitely helps a lot.
Best regards,
Chi
Hi Chi,
Perfect. I will have a look at it and will come back to you.
Cheers, Hajk
Hi Chi,
This should work now seamlessly with:
biomartr::meta.retrieval(kingdom = "bacteria", db = "ensembl", type = "genome")
I hope this helps!
Cheers, Hajk