phylophlan
phylophlan copied to clipboard
Can not download the whole reference genomes.
Dear Developers,
When I download the reference genome using the command "phylophlan_get_reference -g all -o input_genomes/ -n 1 --verbose 2>&1 | tee logs/phylophlan_get_reference.log
" , it usually stopped dowloading before finished.
Downloading file of size: 1.16 MB 1.16 MB 100.06 % 0.48 MB/sec 0 min -0 sec
Downloading 1 reference genomes for k__Bacteria|p__Candidatus_Veblenbacteria|c__Candidatus_Veblenbacteria_unclassified|o__Candidatus_Veblenbacteria_unclassified|f__Candidatus_Veblenbacteria_unclassified|g__Candidatus_Veblenbacteria_unclassified|s__Candidatus_Veblenbacteria_bacterium_RIFOXYC2_FULL_42_11 Downloading "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/822/985/GCA_001822985.1_ASM182298v1/GCA_001822985.1_ASM182298v1_genomic.fna.gz" to "input_genomes/GCA_001822985.fna.gz" Downloading file of size: 0.23 MB 0.23 MB 100.10 % 0.17 MB/sec 0 min -0 sec
Downloading 1 reference genomes for k__Bacteria|p__Candidatus_Veblenbacteria|c__Candidatus_Veblenbacteria_unclassified|o__Candidatus_Veblenbacteria_unclassified|f__Candidatus_Veblenbacteria_unclassified|g__Candidatus_Veblenbacteria_unclassified|s__Candidatus_Veblenbacteria_bacterium_RIFOXYD1_FULL_43_11 Downloading "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/823/015/GCA_001823015.1_ASM182301v1/GCA_001823015.1_ASM182301v1_genomic.fna.gz" to "input_genomes/GCA_001823015.fna.gz" Downloading file of size: 0.19 MB 0.05 MB 24.20 % 0.04 MB/sec 0 min 4 sec
I do not know if it is becasue the connection with ncbi stopped or other reason. How can I do for that?
Hi, I think it might be due to some connection instability. I re-run the command you posted this morning and it is still running (downloading genomes from NCBI). Are you able to try with a different Internet connection?
Thanks, Francesco
Dear Francesco,
Thank you very much for your reply! I tried many times, the condition is similar. So I think if I can download the genome one by one for the remained. Is it possible to have the list of reference genomes? Does all the ones from "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA" is needed?
Yes, you can play a bit with bash and the files downloaded by PhyloPhlAn at the beginning: taxa2genomes_cpa0.2_up201804.txt.bz2
and assembly_summary_genbank.txt
.
- For each line in
taxa2genomes_cpa0.2_up201804.txt.bz2
you should consider the first item of the list (;
separated) of the third field (TAB
separated) - The ID from the previous step is in the form
GCA_001905625.1
, you should split it on the.
and keep only the first part (i.e.,GCA_001905625
) - Then you should get the
ftp_path
from theassembly_summary_genbank.txt
that matches the previous ID to get the URL for downloading - from the URL retrieved from the
assembly_summary_genbank.txt
file, you should replaceftp://
withhttps://
and append_genomic.fna.gz
to the end
Thank you very much for your reply. In order to download the remained large number of genomes, can I revise the file "taxa2genomes_cpa0.2_up201804.txt.bz2" in which the downloaded genome information were cut? But it can't work.
(python3.7) [wl@ts-rd350 Phylophlan]$ phylophlan_get_reference -g all -o input_genomes/ -n 1 --verbose 2>&1 | tee logs/phylophlan_get_reference.log phylophlan_get_reference.py version 3.0.16 (8 May 2020)
Command line: /home/wl/.conda/envs/python3.7/bin/phylophlan_get_reference -g all -o input_genomes/ -n 1 --verbose
Arguments: {'get': 'all', 'list_clades': False, 'database_update': False, 'output_file_extension': '.fna.gz', 'output': 'input_genomes/', 'how_many': 1, 'genbank_mapping': 'assembly_summary_genbank.txt', 'verbose': True} File "taxa2genomes.txt" present File "taxa2genomes_cpa0.2_up201804.txt.bz2" present Output folder "input_genomes/" present File "assembly_summary_genbank.txt" present Traceback (most recent call last): File "/home/wl/.conda/envs/python3.7/bin/phylophlan_get_reference", line 10, in
sys.exit(phylophlan_get_reference()) File "/home/wl/.conda/envs/python3.7/lib/python3.7/site-packages/phylophlan/phylophlan_get_reference.py", line 313, in phylophlan_get_reference args.output_file_extension, args.output, args.database_update, verbose=args.verbose) File "/home/wl/.conda/envs/python3.7/lib/python3.7/site-packages/phylophlan/phylophlan_get_reference.py", line 274, in get_reference_genomes if (taxa_label in r_clean[1].split('|')) or (taxa_label == 'all'): IndexError: list index out of range