GISAIDR
GISAIDR copied to clipboard
Automatically split downloads in chunks for queries with >4000 records
Just a small possible enhancement, but would it be possible to have the download
function automatically split the queries in chunks for searches when the length of list_of_accession_ids
is >4000.
Now I do this myself, e.g. to fetch the most recently uploaded records, using
df= query(
credentials = credentials,
from_subm = as.character(GISAID_max_submdate),
to_subm = as.character(today),
fast = TRUE
)
dim(df) # 103356 1
# function to split vector in chunks of max size chunk_length
chunk = function(x, chunk_length=4000) split(x, ceiling(seq_along(x)/chunk_length))
chunks = chunk(df$accession_id)
downloads = do.call(rbind, lapply(1:length(chunks),
function (chunk) {
message(paste0("Downloading batch ", chunk, " out of ", length(chunks)))
Sys.sleep(3)
return(download(credentials = credentials,
list_of_accession_ids = chunks[[chunk]])) } ))
dim(downloads) # 103356 29
names(downloads)
# [1] "strain" "virus" "accession_id"
# [4] "genbank_accession" "date" "region"
# [7] "country" "division" "location"
# [10] "region_exposure" "country_exposure" "division_exposure"
# [13] "segment" "length" "host"
# [16] "age" "sex" "Nextstrain_clade"
# [19] "pangolin_lineage" "GISAID_clade" "originating_lab"
# [22] "submitting_lab" "authors" "url"
# [25] "title" "paper_url" "date_submitted"
# [28] "purpose_of_sequencing" "sequence"
Even better could be to also have this parallelized (if GISAID would allow that), as the above is still relative slow - it now takes about 1.5 hours to download these 103K records from the last 5 days. If I tried with a chunk size of 5000 I received a Server error, so reduced it to 4000 and that seemed to work...
Thanks, I adapted this into the following:
chunk_size <- 1000
accessions <- query(credentials = credentials,location = "Africa / ...", fast = TRUE)
positions = seq(1, nrow(accessions), by=chunk_size)
is_error <- function(err) inherits(err,'try-error')
chunks = vector("list", length(positions))
# this can be run multiple times to continue downloads
for (index in seq_along(positions)) {
position <- positions[index]
if (is.null(chunks[[index]])) {
message(paste("downloading ", position, chunk_size))
start <- position
end <- min(position + chunk_size, nrow(accessions))
chunk <-
try(download(credentials = credentials,
accessions$accession_id[start:end],
get_sequence = FALSE))
if (is_error(chunk)) {
# refresh credentials and try one more time
credentials <- login(username = username, password = password)
chunk <-
download(credentials = credentials,
accessions$accession_id[start:end],
get_sequence = FALSE)
}
chunk$position <- position
chunks[[index]] <- chunk
Sys.sleep(3)
}
}
if (sum(sapply(chunks, is.null)) == 0) {
# we have downloaded all the chunks
message("download complete")
african_entries = do.call(rbind, chunks)
}