biomartr Point to "assembly_summary.txt" instead of downloading files

Hello! I am using snakemake to parallelize the download of ~100k genomes from RefSeq using biomartr. In doing so, I am running one download Rsession for each genome that I am downloading. At the beginning of each download, I get the following message:

It seems that this is the first time you run this command for refseq.
Thus, 'assembly_summary.txt' files for all kingdoms will be retrieved from refseq.
Don't worry this has to be done only once if you don't restart your R session.


trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/assembly_summary.txt'
Content type 'unknown' length 343595 bytes (335 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt'
Content type 'unknown' length 56089025 bytes (53.5 MB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/assembly_summary.txt'
Content type 'unknown' length 110634 bytes (108 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/invertebrate/assembly_summary.txt'
Content type 'unknown' length 66722 bytes (65 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/assembly_summary.txt'
Content type 'unknown' length 38977 bytes (38 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/protozoa/assembly_summary.txt'
Content type 'unknown' length 29402 bytes (28 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/assembly_summary.txt'
Content type 'unknown' length 44481 bytes (43 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_other/assembly_summary.txt'
Content type 'unknown' length 61251 bytes (59 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/assembly_summary.txt'
Content type 'unknown' length 2561281 bytes (2.4 MB)
==================================================

Since I'm exiting my RSession after downloading each genome of interest, these assembly_summary.txt scripts will be downloaded 100k separate times. Is it possible to download them once and link to their location on my hard drive? Or ask for them to be downloaded if they are not at a specific location on my harddrive?

Thank you!

May 15 '20 18:05 taylorreiter

Hi Taylor,

Many thanks for contacting me and I am very happy to hear that you find biomartr useful.

Unfortunately, in its current form, you cannot pass the assembly_summary.txt file to the function call.

But I am happy to implement this feature so that you can parallelize the download process. Your snakemake pipeline sounds super useful and I am sure that many people will very much appreciate your efforts!

I will keep you posted regarding the new feature.

I hope this helps,

Best wishes, Hajk

May 15 '20 20:05 HajkD

Thank you! In case others find it useful, here is the snakefile I would use to down the accessions I'm interested in:

Snakefile:

import pandas as pd
import re

gtdb_url = "https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/bac120_taxonomy_r89.tsv"
gtdb = pd.read_csv(gtdb_url, sep = "\t")
GENOMES = gtdb[gtdb.columns[0]].unique().tolist()
GENOMES = [genome for genome in GENOMES if 'RS_' in genome] # filter to refseq
GENOMES = [re.sub("RS_", "", genome) for genome in GENOMES]

rule download_genomes:
    output: "inputs/refseq_gtdb_isolates/{genome}_genomic_refseq.fna.gz"
    params: genome = lambda wildcards: wildcards.genome
    #conda: "envs/biomartr.yml"
    script: "scripts/download_gtdb_genomes_biomartr.R"

And the accompanying R file, which in this pipeline would need to be name scripts/download_gtdb_genomes_biomartr.R:

library(biomartr)
getGenome(
  db       = "refseq",
  organism = snakemake@params[["genome"]],
  path     = file.path("inputs", "refseq_gtdb_isolates"))

If biomartr is installed, this snakefile can be executed in a conda environment, env.yml:

channels:
    - conda-forge
    - bioconda
    - defaults
dependencies:
    - python>=3.6
    - snakemake-minimal=5.8.1
    - pandas=0.25.3

To run the whole thing with these three files:

conda create -f env.yml -n env
conda activate env
snakemake -s Snakefile

I explored using conda to install biomartr, but there is no conda package for it yet so it needs to be installed by the user.

May 15 '20 21:05 taylorreiter

This is absolutely brilliant! Thank you so much for sharing this with the community!

If it helps, I planned on writing a bioconda recipe for biomartr at some point anyway, so I could develop one alongside the new assembly_summary.txt functionality.

May 16 '20 09:05 HajkD

Maybe stupid question, but what information is available in assembly_summary.txt that is required?

@taylorreiter If this is still an active project of yours ... why do you open a new R session (and a new snakemake task) for each download individually? I understand that it might be easier to track failed files - but maybe this could also be done in your R script with clear error messages, and therefore you would not need to open more than 100k times a new snakemake task, a new R session and load the library.

Mar 03 '21 10:03 johanneswerner

The package now supports to set a cache directory for back end files like this, see the function:

?cachedir_set

This issue can now be closed

Sep 27 '23 10:09 Roleren

Dear @Roleren

Thank you so much for adding this elegant solution!

Dear @taylorreiter

I hope this works for you and helps to optimise your workflow?

With many thanks, Hajk

Sep 27 '23 10:09 HajkD

Yes this is awesome, thank you for the update!

Sep 27 '23 11:09 taylorreiter

biomartr biomartr copied to clipboard

Point to "assembly_summary.txt" instead of downloading files

biomartr
biomartr copied to clipboard