vroom icon indicating copy to clipboard operation
vroom copied to clipboard

vroom doesn't respect skip/comment options for remote (HTTP) gzipped CSVs

Open davidski opened this issue 2 years ago • 0 comments

I've recently hit a strange edge case where vroom doesn't respect skip (nor comment) options when reading a gziped CSV on a remote URL connection. Reprex is below that shows an empty tibble being created. Downloading the file does allow the read to work as expected, as does operating on a non-GZiped remote file.

I've seen some issues on readr that suggest a limitation with the base R connection readers that may account for this, but I'm unable to see any known limitations that say this shouldn't work. Hoping this may be a fix that can be implemented here (and then picked up in readr).

Please let me know if there is any additional info I can give for debugging. Thanks for the great package!

library(vroom)
url <- "https://epss.cyentia.com/epss_scores-2022-02-04.csv.gz"

#this fails
vroom(url, comment ="#")
#> New names:
#> * `` -> ...1
#> Rows: 0 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#> 
#> chr (1): ...1
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 0 × 1
#> # … with 1 variable: ...1 <chr>

#this works
tmp <- tempfile()
download.file(url, tmp)
vroom(tmp, comment = "#")
#> Rows: 168325 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): cve
#> dbl (2): epss, percentile
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 168,325 × 3
#>    cve              epss percentile
#>    <chr>           <dbl>      <dbl>
#>  1 CVE-2021-42013 0.816       0.995
#>  2 CVE-2021-1732  0.0910      0.869
#>  3 CVE-2021-4034  0.0364      0.666
#>  4 CVE-2013-1763  0.0382      0.673
#>  5 CVE-2014-3153  0.0230      0.583
#>  6 CVE-2019-17497 0.0525      0.759
#>  7 CVE-2018-4993  0.722       0.991
#>  8 CVE-2021-44228 0.944       0.999
#>  9 CVE-2019-5420  0.931       0.999
#> 10 CVE-2021-3156  0.587       0.986
#> # … with 168,315 more rows
unlink(tmp)

Created on 2022-02-04 by the reprex package (v2.0.1)

Session info
sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] vroom_1.5.7
#> 
#> loaded via a namespace (and not attached):
#>  [1] pillar_1.6.2      compiler_4.1.2    highr_0.9         R.methodsS3_1.8.1
#>  [5] R.utils_2.11.0    tools_4.1.2       digest_0.6.27     bit_4.0.4        
#>  [9] evaluate_0.14     lifecycle_1.0.0   tibble_3.1.6      R.cache_0.15.0   
#> [13] pkgconfig_2.0.3   rlang_1.0.0       reprex_2.0.1      cli_3.0.1        
#> [17] rstudioapi_0.13   curl_4.3.2        parallel_4.1.2    yaml_2.2.1       
#> [21] xfun_0.29         fastmap_1.1.0     withr_2.4.2       styler_1.6.2     
#> [25] stringr_1.4.0     knitr_1.37        fs_1.5.0          vctrs_0.3.8      
#> [29] bit64_4.0.5       tidyselect_1.1.1  glue_1.6.1        fansi_0.5.0      
#> [33] rmarkdown_2.10    tzdb_0.2.0        purrr_0.3.4       magrittr_2.0.1   
#> [37] backports_1.2.1   ellipsis_0.3.2    htmltools_0.5.2   utf8_1.2.2       
#> [41] stringi_1.7.4     crayon_1.4.1      R.oo_1.24.0

davidski avatar Feb 04 '22 19:02 davidski