asdfree icon indicating copy to clipboard operation
asdfree copied to clipboard

Large NHANES files cause lodown() to hang [SOLVED]

Open dwinsemius opened this issue 3 years ago • 2 comments

Apologies for the noise below, but it might save someone some time because it demonstrates how to save some time or bandwidth if that is an issue. Eventually I figured out that setting the timeout for libcurl methods would have been the proper way to do this.

options(timeout = max(300, getOption("timeout")))

It might be useful to mention in the documentation for lodown that this option exists. I tried adding timeout=300 as a parameter to lodown, but it didn't seem to get passed on to the download.files function.

See: https://stat.ethz.ch/R-manual/R-devel/library/utils/html/download.file.html

###---- what I initially did before figuring this out

I could get around this problem since none of these files have anything of interest to me, but this issue report might be useful to someone who has either less debugging/sidestepping capability or who is interested in whatever those files contain. The "breaking point" seemed to be files with more than 50 MB

Problem file examples (mostly dietary history): https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DRXIFF.XPT
https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DSII.XPT https://wwwn.cdc.gov/Nchs/Nhanes/2003-2004/DR1IFF_C.XPT # pattern "DR.IFF"

My method was to identify the names of problem files and then exclude any with similar name from a "reduced catalog":

help("get_catalog") library(lodown) NH_catalog <- get_catalog( "nhanes" ) building catalog for nhanes

Warning message: In xtfrm.data.frame(x) : cannot xtfrm data frames

str(NH_catalog) 'data.frame': 1386 obs. of 8 variables: $ years : chr "1988-2020" "1999-2000" "1999-2000" "1999-2000" ... $ data_name : chr "Prescription Medications - Drug Information" "Acculturation" "Albumin & Creatinine - Urine" "Alcohol Use" ... $ doc_name : chr "RXQ_DRUG Doc" "ACQ Doc" "LAB16 Doc" "ALQ Doc" ... $ file_name : chr "RXQ_DRUG Data [XPT - 2.6 KB]" "ACQ Data [XPT - 629.2 KB]" "LAB16 Data [XPT - 306.9 KB]" "ALQ Data [XPT - 314.5 KB]" ... $ date_published : chr "Updated September 2021" "June 2002" "June 2002" "June 2002" ... $ full_url : chr "https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/RXQ_DRUG.xpt" "https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/ACQ.XPT" "https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/LAB16.XPT" "https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/ALQ.XPT" ... $ doc_url : chr "https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/RXQ_DRUG.htm" "https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/ACQ.htm" "https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/LAB16.htm" "https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/ALQ.htm" ... $ output_filename: chr "/home/david/1988-2020/rxq_drug.rds" "/home/david/1999-2000/acq.rds" "/home/david/1999-2000/lab16.rds" "/home/david/1999-2000/alq.rds" ...

Transcript of the console session for the first problem file and the traceback results and sessionInfo() output

R version 4.1.2 (2021-11-01) -- "Bird Hippie" Copyright (C) 2021 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

library(lodown) lodown( "nhanes" , output_dir = file.path( path.expand( "~" ) , "NHANES" ) ) building catalog for nhanes

locally downloading nhanes

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/RXQ_DRUG.xpt' cached in '/tmp/7b19911d55e805e1d0805ff339a17653.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 1 of 1386 stored at '/home/david/NHANES/1988-2020/rxq_drug.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/ACQ.XPT' cached in '/tmp/703b77f37fc02825377c4bfeec43cf6a.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 2 of 1386 stored at '/home/david/NHANES/1999-2000/acq.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/LAB16.XPT' cached in '/tmp/3409c25fb06bd690bd6ac0c1eb81a399.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 3 of 1386 stored at '/home/david/NHANES/1999-2000/lab16.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/ALQ.XPT' cached in '/tmp/c4526e05ffc794c15b797c11e006de19.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 4 of 1386 stored at '/home/david/NHANES/1999-2000/alq.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/RXQ_ANA.XPT' cached in '/tmp/e0b28848499fa2fc04ebe5959478e181.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 5 of 1386 stored at '/home/david/NHANES/1999-2000/rxq_ana.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/AUX1.XPT' cached in '/tmp/7eb2dcca746abe485bbb52fccef0bed2.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 6 of 1386 stored at '/home/david/NHANES/1999-2000/aux1.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/AUQ.XPT' cached in '/tmp/3ac023bc137a36066f123d398bd11b80.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 7 of 1386 stored at '/home/david/NHANES/1999-2000/auq.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/AUXAR.XPT' cached in '/tmp/5fbb8b9e5bb0c8fcd7930827ceb8a9b3.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 8 of 1386 stored at '/home/david/NHANES/1999-2000/auxar.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/AUXTYM.XPT' cached in '/tmp/c457f6bae389f97d3e027b09e9cd9234.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 9 of 1386 stored at '/home/david/NHANES/1999-2000/auxtym.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/BAX.XPT' cached in '/tmp/2377265dc49782af5ad9888e2d7cbe2c.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 10 of 1386 stored at '/home/david/NHANES/1999-2000/bax.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/BAQ.XPT' cached in '/tmp/4b4e0e1a3fb9294ce9d7500fdf89b16f.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 11 of 1386 stored at '/home/david/NHANES/1999-2000/baq.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/BIX.XPT' cached in '/tmp/f44b41d1b39c8e6ee786f1283c9a0486.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 12 of 1386 stored at '/home/david/NHANES/1999-2000/bix.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/BPX.XPT' cached in '/tmp/3a732006357a37a351e5b9ee67063394.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 13 of 1386 stored at '/home/david/NHANES/1999-2000/bpx.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/BPQ.XPT' cached in '/tmp/0fbc99763e9dd4d4e6b1acc0dc793fec.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 14 of 1386 stored at '/home/david/NHANES/1999-2000/bpq.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/BMX.XPT' cached in '/tmp/a50587b7bde42d582e73705c076cdb2c.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 15 of 1386 stored at '/home/david/NHANES/1999-2000/bmx.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/LAB06.XPT' cached in '/tmp/47e7fede22e132bdace16df577669950.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 16 of 1386 stored at '/home/david/NHANES/1999-2000/lab06.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/CVX.XPT' cached in '/tmp/261acc9999609268718cb6439168a2ba.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 17 of 1386 stored at '/home/david/NHANES/1999-2000/cvx.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/CDQ.XPT' cached in '/tmp/69432e2b673f861fecaf82b15661be8b.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 18 of 1386 stored at '/home/david/NHANES/1999-2000/cdq.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/LAB05.XPT' cached in '/tmp/7d25856bc319dacccdc2aca4f6e76cf2.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 19 of 1386 stored at '/home/david/NHANES/1999-2000/lab05.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/LAB13AM.XPT' cached in '/tmp/af34b4ac36198259a29e7038e49abc54.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 20 of 1386 stored at '/home/david/NHANES/1999-2000/lab13am.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/LAB13.XPT' cached in '/tmp/29ae9c85928a8a7f2d31e5dc2c3c4bae.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 21 of 1386 stored at '/home/david/NHANES/1999-2000/lab13.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/CFQ.XPT' cached in '/tmp/93c1d40e51bf96fd4e5350148a972247.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 22 of 1386 stored at '/home/david/NHANES/1999-2000/cfq.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/LAB25.XPT' cached in '/tmp/2e6fd2a654a4b81381f61ff53e6ce3c3.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 23 of 1386 stored at '/home/david/NHANES/1999-2000/lab25.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/LAB11.XPT' cached in '/tmp/0e46738dad51551884e7e1490d127da9.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 24 of 1386 stored at '/home/david/NHANES/1999-2000/lab11.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/LAB17.XPT' cached in '/tmp/20d755e748b9fb2cdcf7eb7b5b8818ba.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 25 of 1386 stored at '/home/david/NHANES/1999-2000/lab17.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/HSQ.XPT' cached in '/tmp/bb918ddfbf9cce80511df6e533605c2b.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 26 of 1386 stored at '/home/david/NHANES/1999-2000/hsq.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DEMO.XPT' cached in '/tmp/6ff21da382befb12fdde3332474aff2c.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 27 of 1386 stored at '/home/david/NHANES/1999-2000/demo.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DEQ.XPT' cached in '/tmp/d4af86bdb83265fde449d980eede53a2.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 28 of 1386 stored at '/home/david/NHANES/1999-2000/deq.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DIQ.XPT' cached in '/tmp/d41d69184a500f8a8935b92a6c7d7c5d.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 29 of 1386 stored at '/home/david/NHANES/1999-2000/diq.rds'

'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DBQ.XPT' cached in '/tmp/0cf86e9ecc8e5e0731e3ed0de0bbef0d.Rcache' copying to '/tmp/RtmplQqmna/file72047806a096'

nhanes catalog entry 30 of 1386 stored at '/home/david/NHANES/1999-2000/dbq.rds'

downloading from URL 'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DRXIFF.XPT' to file '/tmp/RtmplQqmna/file72047806a096'

trying URL 'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DRXIFF.XPT' Content type 'application/octet-stream' length 71600960 bytes (68.3 MB)

downloaded 52.3 MB

download issue with 'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DRXIFF.XPT'

Warning messages: 1: In xtfrm.data.frame(x) : cannot xtfrm data frames 2: In (function (url, destfile, method, quiet = FALSE, mode = "w", : downloaded length 54835393 != reported length 71600960 3: In (function (url, destfile, method, quiet = FALSE, mode = "w", : URL 'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DRXIFF.XPT': Timeout of 60 seconds was reached years 1 1988-2020 2 1999-2000

Snipped a lot of output

109 /home/david/NHANES/2001-2002/auq_b.rds NA 110 /home/david/NHANES/2001-2002/aux_b.rds NA 111 /home/david/NHANES/2001-2002/auxar_b.rds NA [ reached 'max' / getOption("max.print") -- omitted 1275 rows ]

traceback() 5: Sys.sleep(sleepsec) 4: cachaca(catalog[i, "full_url"], tf, mode = "wb") 3: load_fun(data_name = data_name, catalog, ...) 2: withCallingHandlers(catalog <- load_fun(data_name = data_name, catalog, ...), error = function(e) { print(sessionInfo()) if (grepl("cannot allocate vector of size", e)) message(memory_note) else if (grepl("parameter must be specified", e)) message(parameter_note) else if (grepl("to install", e)) message(installation_note) else { message(unknown_error_note) print(sys.calls()) } }) 1: lodown("nhanes", output_dir = file.path(path.expand("~"), "NHANES")) sessionInfo() R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] lodown_0.1.0

loaded via a namespace (and not attached): [1] fansi_1.0.2 digest_0.6.29 utf8_1.2.2 crayon_1.4.2 R6_2.5.1 lifecycle_1.0.1 magrittr_2.0.2 [8] pillar_1.7.0 httr_1.4.2 stringi_1.7.6 rlang_1.0.1 cli_3.1.1 curl_4.3.2 xml2_1.3.3
[15] vctrs_0.3.8 ellipsis_0.3.2 tools_4.1.2 foreign_0.8-81 stringr_1.4.0 selectr_0.4-2 glue_1.6.1
[22] compiler_4.1.2 pkgconfig_2.0.3 rvest_1.0.2 tibble_3.1.6

dwinsemius avatar Feb 08 '22 00:02 dwinsemius

hi! are you able to isolate the reason behind that timeout? i get

> system.time( download.file( "https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DRXIFF.XPT" , tempfile() , mode = 'wb' ) )
trying URL 'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DRXIFF.XPT'
Content type 'application/octet-stream' length 71600960 bytes (68.3 MB)
downloaded 68.3 MB

   user  system elapsed 
   1.10    1.64   66.29 

we could surely edit https://github.com/ajdamico/lodown/blob/master/R/cachaca.R but i'm a bit confused why your libcurl defaults to only one minute? :-/

thanks! hope you are excellent

ajdamico avatar Feb 08 '22 06:02 ajdamico

On Feb 7, 2022, at 10:21 PM, Anthony Damico @.***> wrote:

hi! are you able to isolate the reason behind that timeout? i get

system.time( download.file( "https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DRXIFF.XPT" , tempfile() , mode = 'wb' ) ) trying URL 'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DRXIFF.XPT' Content type 'application/octet-stream' length 71600960 bytes (68.3 MB) downloaded 68.3 MB

user system elapsed 1.10 1.64 66.29

we could surely edit https://github.com/ajdamico/lodown/blob/master/R/cachaca.R but i'm a bit confused why your libcurl defaults to only one minute? :-/

thanks! hope you are excellent

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

Here's my Mac experiment:

-- Best Regards; David.

options('timeout') $timeout [1] 60

system.time( download.file( "https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DRXIFF.XPT" , tempfile() , mode = 'wb' ) ) trying URL 'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DRXIFF.XPT' Content type 'application/octet-stream' length 71600960 bytes (68.3 MB) ===================================== downloaded 51.3 MB

Error in download.file("https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DRXIFF.XPT", : download from 'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DRXIFF.XPT' failed In addition: Warning messages: 1: In download.file("https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DRXIFF.XPT", : downloaded length 53763602 != reported length 71600960 2: In download.file("https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DRXIFF.XPT", : URL 'https://wwwn.cdc.gov/Nchs/Nhanes/1999-2000/DRXIFF.XPT': Timeout of 60 seconds was reached Timing stopped at: 1.455 1.659 60.01

Hope this is useful.(It's not the most recent version of RCurl since I am having some problems with gfortran versions. I've got version 10 and the source code seems to expect version 7.)

#-------------> sessionInfo() R version 4.1.1 (2021-08-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 11.6

Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] devtools_2.4.3 usethis_2.1.5

loaded via a namespace (and not attached): [1] magrittr_2.0.2 pkgload_1.2.4 R6_2.5.1 rlang_1.0.1 fastmap_1.1.0
[6] tools_4.1.1 pkgbuild_1.3.0 sessioninfo_1.2.2 cli_3.1.1 withr_2.4.3
[11] ellipsis_0.3.2 remotes_2.4.2 rprojroot_2.0.2 lifecycle_1.0.1 crayon_1.4.2
[16] processx_3.5.2 purrr_0.3.4 callr_3.7.0 fs_1.5.2 ps_1.6.0
[21] curl_4.3.2 testthat_3.1.1 memoise_2.0.1 glue_1.6.1 cachem_1.0.6
[26] compiler_4.1.1 desc_1.4.0 prettyunits_1.1.1

dwinsemius avatar Feb 09 '22 01:02 dwinsemius

hi! apologies for the long delay. i've made a couple of big updates to asdfree.com that hopefully make the website a bit better, but i've decided to stop maintaining the lodown package so probably won't fix the bug you've reported. the new asdfree does have nhanes data, but only for the most current year. thanks and hope you are great

ajdamico avatar Jan 09 '24 02:01 ajdamico