GISAIDR icon indicating copy to clipboard operation
GISAIDR copied to clipboard

Automatically split downloads in chunks for queries with >4000 records

Open tomwenseleers opened this issue 2 years ago • 1 comments

Just a small possible enhancement, but would it be possible to have the download function automatically split the queries in chunks for searches when the length of list_of_accession_ids is >4000.

Now I do this myself, e.g. to fetch the most recently uploaded records, using

df= query(
  credentials = credentials, 
  from_subm = as.character(GISAID_max_submdate), 
  to_subm = as.character(today),
  fast = TRUE
)
dim(df) # 103356      1
# function to split vector in chunks of max size chunk_length
chunk = function(x, chunk_length=4000) split(x, ceiling(seq_along(x)/chunk_length))

chunks = chunk(df$accession_id)
downloads = do.call(rbind, lapply(1:length(chunks),
                   function (chunk) {
                     message(paste0("Downloading batch ", chunk, " out of ", length(chunks)))
                     Sys.sleep(3)
                     return(download(credentials = credentials, 
                              list_of_accession_ids = chunks[[chunk]])) } ))
dim(downloads) # 103356     29
names(downloads)
# [1] "strain"                "virus"                 "accession_id"         
# [4] "genbank_accession"     "date"                  "region"               
# [7] "country"               "division"              "location"             
# [10] "region_exposure"       "country_exposure"      "division_exposure"    
# [13] "segment"               "length"                "host"                 
# [16] "age"                   "sex"                   "Nextstrain_clade"     
# [19] "pangolin_lineage"      "GISAID_clade"          "originating_lab"      
# [22] "submitting_lab"        "authors"               "url"                  
# [25] "title"                 "paper_url"             "date_submitted"       
# [28] "purpose_of_sequencing" "sequence"  

Even better could be to also have this parallelized (if GISAID would allow that), as the above is still relative slow - it now takes about 1.5 hours to download these 103K records from the last 5 days. If I tried with a chunk size of 5000 I received a Server error, so reduced it to 4000 and that seemed to work...

tomwenseleers avatar Jul 30 '22 12:07 tomwenseleers

Thanks, I adapted this into the following:

chunk_size <- 1000

accessions <- query(credentials = credentials,location = "Africa / ...", fast = TRUE)
positions = seq(1, nrow(accessions), by=chunk_size)
is_error <- function(err) inherits(err,'try-error')

chunks = vector("list", length(positions))

# this can be run multiple times to continue downloads
for (index in seq_along(positions)) {
  position <- positions[index]
  if (is.null(chunks[[index]])) {
    message(paste("downloading ", position, chunk_size))
    start <- position
    end <- min(position + chunk_size, nrow(accessions))
    chunk <-
      try(download(credentials = credentials,
                   accessions$accession_id[start:end],
                   get_sequence = FALSE))
    if (is_error(chunk)) {
      # refresh credentials and try one more time
      credentials <- login(username = username, password = password)
      chunk <-
        download(credentials = credentials,
                 accessions$accession_id[start:end],
                 get_sequence = FALSE)
    }
    chunk$position <- position
    chunks[[index]] <- chunk
    Sys.sleep(3)
  }
}

if (sum(sapply(chunks, is.null)) == 0) {
  # we have downloaded all the chunks
  message("download complete")
  african_entries = do.call(rbind, chunks)
}

pvanheus avatar Nov 17 '22 12:11 pvanheus