phylophlan icon indicating copy to clipboard operation
phylophlan copied to clipboard

Can not download the whole reference genomes.

Open Lily-WL opened this issue 4 years ago • 4 comments

Dear Developers,

When I download the reference genome using the command "phylophlan_get_reference -g all -o input_genomes/ -n 1 --verbose 2>&1 | tee logs/phylophlan_get_reference.log" , it usually stopped dowloading before finished.

Downloading file of size: 1.16 MB 1.16 MB 100.06 % 0.48 MB/sec 0 min -0 sec
Downloading 1 reference genomes for k__Bacteria|p__Candidatus_Veblenbacteria|c__Candidatus_Veblenbacteria_unclassified|o__Candidatus_Veblenbacteria_unclassified|f__Candidatus_Veblenbacteria_unclassified|g__Candidatus_Veblenbacteria_unclassified|s__Candidatus_Veblenbacteria_bacterium_RIFOXYC2_FULL_42_11 Downloading "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/822/985/GCA_001822985.1_ASM182298v1/GCA_001822985.1_ASM182298v1_genomic.fna.gz" to "input_genomes/GCA_001822985.fna.gz" Downloading file of size: 0.23 MB 0.23 MB 100.10 % 0.17 MB/sec 0 min -0 sec
Downloading 1 reference genomes for k__Bacteria|p__Candidatus_Veblenbacteria|c__Candidatus_Veblenbacteria_unclassified|o__Candidatus_Veblenbacteria_unclassified|f__Candidatus_Veblenbacteria_unclassified|g__Candidatus_Veblenbacteria_unclassified|s__Candidatus_Veblenbacteria_bacterium_RIFOXYD1_FULL_43_11 Downloading "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/823/015/GCA_001823015.1_ASM182301v1/GCA_001823015.1_ASM182301v1_genomic.fna.gz" to "input_genomes/GCA_001823015.fna.gz" Downloading file of size: 0.19 MB 0.05 MB 24.20 % 0.04 MB/sec 0 min 4 sec

I do not know if it is becasue the connection with ncbi stopped or other reason. How can I do for that?

Lily-WL avatar Jul 07 '20 01:07 Lily-WL

Hi, I think it might be due to some connection instability. I re-run the command you posted this morning and it is still running (downloading genomes from NCBI). Are you able to try with a different Internet connection?

Thanks, Francesco

fasnicar avatar Jul 07 '20 09:07 fasnicar

Dear Francesco,

Thank you very much for your reply! I tried many times, the condition is similar. So I think if I can download the genome one by one for the remained. Is it possible to have the list of reference genomes? Does all the ones from "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA" is needed?

Lily-WL avatar Jul 08 '20 01:07 Lily-WL

Yes, you can play a bit with bash and the files downloaded by PhyloPhlAn at the beginning: taxa2genomes_cpa0.2_up201804.txt.bz2 and assembly_summary_genbank.txt.

  1. For each line in taxa2genomes_cpa0.2_up201804.txt.bz2 you should consider the first item of the list (; separated) of the third field (TAB separated)
  2. The ID from the previous step is in the form GCA_001905625.1, you should split it on the . and keep only the first part (i.e., GCA_001905625)
  3. Then you should get the ftp_path from the assembly_summary_genbank.txt that matches the previous ID to get the URL for downloading
  4. from the URL retrieved from the assembly_summary_genbank.txt file, you should replace ftp:// with https:// and append _genomic.fna.gz to the end

fasnicar avatar Jul 08 '20 12:07 fasnicar

Thank you very much for your reply. In order to download the remained large number of genomes, can I revise the file "taxa2genomes_cpa0.2_up201804.txt.bz2" in which the downloaded genome information were cut? But it can't work.

(python3.7) [wl@ts-rd350 Phylophlan]$ phylophlan_get_reference -g all -o input_genomes/ -n 1 --verbose 2>&1 | tee logs/phylophlan_get_reference.log phylophlan_get_reference.py version 3.0.16 (8 May 2020)

Command line: /home/wl/.conda/envs/python3.7/bin/phylophlan_get_reference -g all -o input_genomes/ -n 1 --verbose

Arguments: {'get': 'all', 'list_clades': False, 'database_update': False, 'output_file_extension': '.fna.gz', 'output': 'input_genomes/', 'how_many': 1, 'genbank_mapping': 'assembly_summary_genbank.txt', 'verbose': True} File "taxa2genomes.txt" present File "taxa2genomes_cpa0.2_up201804.txt.bz2" present Output folder "input_genomes/" present File "assembly_summary_genbank.txt" present Traceback (most recent call last): File "/home/wl/.conda/envs/python3.7/bin/phylophlan_get_reference", line 10, in sys.exit(phylophlan_get_reference()) File "/home/wl/.conda/envs/python3.7/lib/python3.7/site-packages/phylophlan/phylophlan_get_reference.py", line 313, in phylophlan_get_reference args.output_file_extension, args.output, args.database_update, verbose=args.verbose) File "/home/wl/.conda/envs/python3.7/lib/python3.7/site-packages/phylophlan/phylophlan_get_reference.py", line 274, in get_reference_genomes if (taxa_label in r_clean[1].split('|')) or (taxa_label == 'all'): IndexError: list index out of range

Lily-WL avatar Jul 09 '20 08:07 Lily-WL