vroom
vroom copied to clipboard
vroom doesn't respect skip/comment options for remote (HTTP) gzipped CSVs
I've recently hit a strange edge case where vroom
doesn't respect skip
(nor comment
) options when reading a gziped CSV on a remote URL connection. Reprex is below that shows an empty tibble being created. Downloading the file does allow the read to work as expected, as does operating on a non-GZiped remote file.
I've seen some issues on readr
that suggest a limitation with the base R connection readers that may account for this, but I'm unable to see any known limitations that say this shouldn't work. Hoping this may be a fix that can be implemented here (and then picked up in readr).
Please let me know if there is any additional info I can give for debugging. Thanks for the great package!
library(vroom)
url <- "https://epss.cyentia.com/epss_scores-2022-02-04.csv.gz"
#this fails
vroom(url, comment ="#")
#> New names:
#> * `` -> ...1
#> Rows: 0 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#>
#> chr (1): ...1
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 0 × 1
#> # … with 1 variable: ...1 <chr>
#this works
tmp <- tempfile()
download.file(url, tmp)
vroom(tmp, comment = "#")
#> Rows: 168325 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): cve
#> dbl (2): epss, percentile
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 168,325 × 3
#> cve epss percentile
#> <chr> <dbl> <dbl>
#> 1 CVE-2021-42013 0.816 0.995
#> 2 CVE-2021-1732 0.0910 0.869
#> 3 CVE-2021-4034 0.0364 0.666
#> 4 CVE-2013-1763 0.0382 0.673
#> 5 CVE-2014-3153 0.0230 0.583
#> 6 CVE-2019-17497 0.0525 0.759
#> 7 CVE-2018-4993 0.722 0.991
#> 8 CVE-2021-44228 0.944 0.999
#> 9 CVE-2019-5420 0.931 0.999
#> 10 CVE-2021-3156 0.587 0.986
#> # … with 168,315 more rows
unlink(tmp)
Created on 2022-02-04 by the reprex package (v2.0.1)
Session info
sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] vroom_1.5.7
#>
#> loaded via a namespace (and not attached):
#> [1] pillar_1.6.2 compiler_4.1.2 highr_0.9 R.methodsS3_1.8.1
#> [5] R.utils_2.11.0 tools_4.1.2 digest_0.6.27 bit_4.0.4
#> [9] evaluate_0.14 lifecycle_1.0.0 tibble_3.1.6 R.cache_0.15.0
#> [13] pkgconfig_2.0.3 rlang_1.0.0 reprex_2.0.1 cli_3.0.1
#> [17] rstudioapi_0.13 curl_4.3.2 parallel_4.1.2 yaml_2.2.1
#> [21] xfun_0.29 fastmap_1.1.0 withr_2.4.2 styler_1.6.2
#> [25] stringr_1.4.0 knitr_1.37 fs_1.5.0 vctrs_0.3.8
#> [29] bit64_4.0.5 tidyselect_1.1.1 glue_1.6.1 fansi_0.5.0
#> [33] rmarkdown_2.10 tzdb_0.2.0 purrr_0.3.4 magrittr_2.0.1
#> [37] backports_1.2.1 ellipsis_0.3.2 htmltools_0.5.2 utf8_1.2.2
#> [41] stringi_1.7.4 crayon_1.4.1 R.oo_1.24.0