incomplete download from URL

Open r2evans opened this issue 8 months ago • 1 comments

Motived by https://stackoverflow.com/questions/79587028/read-csv-only-reads-a-fraction-of-rows-from-a-zipped-file-when-reading-from-url

A gzipped file is only being downloaded partially, whereas reading in the pre-downloaded file reads in every row. Notice 429497 rows when loaded as a local file, 1798 when loaded by URL.

download.file("https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_station/USW00014839.csv.gz", "USW00014839.csv.gz")
# trying URL 'https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_station/USW00014839.csv.gz'
# Content type 'application/gzip' length 1699132 bytes (1.6 MB)
# ==================================================
# downloaded 1.6 MB

vroom::vroom("USW00014839.csv.gz", col_names = FALSE)
# Warning: One or more parsing issues, call `problems()` on your data frame for details, e.g.:
#   dat <- vroom(...)
#   problems(dat)
# Rows: 429498 Columns: 8
# ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
# Delimiter: ","
# chr (4): X1, X3, X5, X7
# dbl (3): X2, X4, X8
# lgl (1): X6
# ℹ Use `spec()` to retrieve the full column specification for this data.
# ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# # A tibble: 429,498 × 8
#    X1                X2 X3       X4 X5    X6    X7       X8
#    <chr>          <dbl> <chr> <dbl> <chr> <lgl> <chr> <dbl>
#  1 USW00014839 19380401 TMAX     22 <NA>  NA    X        NA
#  2 USW00014839 19380402 TMAX      6 <NA>  NA    X        NA
#  3 USW00014839 19380403 TMAX     33 <NA>  NA    X        NA
#  4 USW00014839 19380404 TMAX     33 <NA>  NA    X        NA
#  5 USW00014839 19380405 TMAX      0 <NA>  NA    X        NA
#  6 USW00014839 19380406 TMAX      0 <NA>  NA    X        NA
#  7 USW00014839 19380407 TMAX     17 <NA>  NA    X        NA
#  8 USW00014839 19380408 TMAX     22 <NA>  NA    X        NA
#  9 USW00014839 19380409 TMAX     56 <NA>  NA    X        NA
# 10 USW00014839 19380410 TMAX    117 <NA>  NA    X        NA
# # ℹ 429,488 more rows
# # ℹ Use `print(n = ...)` to see more rows

vroom::vroom("https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_station/USW00014839.csv.gz", col_names = FALSE)
# Warning: One or more parsing issues, call `problems()` on your data frame for details, e.g.:
#   dat <- vroom(...)
#   problems(dat)
# Rows: 1798 Columns: 8
# ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
# Delimiter: ","
# chr (3): X1, X3, X7
# dbl (2): X2, X4
# lgl (3): X5, X6, X8
# ℹ Use `spec()` to retrieve the full column specification for this data.
# ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# # A tibble: 1,798 × 8
#    X1                X2 X3       X4 X5    X6    X7    X8   
#    <chr>          <dbl> <chr> <dbl> <lgl> <lgl> <chr> <lgl>
#  1 USW00014839 19380401 TMAX     22 NA    NA    X     NA   
#  2 USW00014839 19380402 TMAX      6 NA    NA    X     NA   
#  3 USW00014839 19380403 TMAX     33 NA    NA    X     NA   
#  4 USW00014839 19380404 TMAX     33 NA    NA    X     NA   
#  5 USW00014839 19380405 TMAX      0 NA    NA    X     NA   
#  6 USW00014839 19380406 TMAX      0 NA    NA    X     NA   
#  7 USW00014839 19380407 TMAX     17 NA    NA    X     NA   
#  8 USW00014839 19380408 TMAX     22 NA    NA    X     NA   
#  9 USW00014839 19380409 TMAX     56 NA    NA    X     NA   
# 10 USW00014839 19380410 TMAX    117 NA    NA    X     NA   
# # ℹ 1,788 more rows
# # ℹ Use `print(n = ...)` to see more rows

The problems() suggestion does not help much, it's mostly about a few parsing errors indicating an incorrect guess at lgl:

# # A tibble: 3 × 5
#     row   col expected           actual file 
#   <int> <int> <chr>              <chr>  <chr>
# 1    15     6 1/0/T/F/TRUE/FALSE I      ""   
# 2    44     6 1/0/T/F/TRUE/FALSE I      ""   
# 3   858     6 1/0/T/F/TRUE/FALSE S      ""

I tested at least one other gzipped file on that site (USW00014838.csv.gz), and it shows the same behavior, albeit at a different cut point (6628 of 204830 rows). It does not appear to be a clear networking issue, as both download.file (shown above) and data.table::fread (not shown) read in the entire file (both, actually) without message/warning/error, and it produces the same number of rows each time (multiple users, likely multiple OSes but not verified).

session_info()

sessioninfo::session_info(pkgs = "vroom")
# ─ Session info ────────────────────────────────────────────────────────────────────────────────────────────────────
#  setting  value
#  version  R version 4.4.3 (2025-02-28)
#  os       macOS Sequoia 15.4.1
#  system   aarch64, darwin20
#  ui       X11
#  language (EN)
#  collate  en_US.UTF-8
#  ctype    en_US.UTF-8
#  tz       America/New_York
#  date     2025-04-22
#  pandoc   3.6.4 @ /opt/homebrew/opt/pandoc/bin/ (via rmarkdown)
#  quarto   1.6.43 @ /usr/local/bin/quarto
# ─ Packages ────────────────────────────────────────────────────────────────────────────────────────────────────────
#  package     * version date (UTC) lib source
#  bit           4.5.0.1 2024-12-03 [1] RSPM (R 4.4.0)
#  bit64         4.6.0-1 2025-01-16 [1] RSPM (R 4.4.0)
#  cli           3.6.4   2025-02-13 [1] RSPM (R 4.4.3)
#  cpp11         0.5.1   2024-12-04 [1] RSPM (R 4.4.0)
#  crayon        1.5.3   2024-06-20 [1] RSPM (R 4.4.3)
#  fansi         1.0.6   2023-12-08 [1] RSPM (R 4.4.3)
#  glue          1.8.0   2024-09-30 [1] RSPM (R 4.4.3)
#  hms           1.1.3   2023-03-21 [1] RSPM (R 4.4.0)
#  lifecycle     1.0.4   2023-11-07 [1] RSPM (R 4.4.3)
#  magrittr      2.0.3   2022-03-30 [1] RSPM (R 4.4.3)
#  pillar        1.10.1  2025-01-07 [1] RSPM (R 4.4.0)
#  pkgconfig     2.0.3   2019-09-22 [1] RSPM (R 4.4.3)
#  prettyunits   1.2.0   2023-09-24 [1] RSPM (R 4.4.3)
#  progress      1.2.3   2023-12-06 [1] RSPM (R 4.4.0)
#  R6            2.6.0   2025-02-12 [1] RSPM (R 4.4.0)
#  rlang         1.1.5   2025-01-17 [1] RSPM (R 4.4.0)
#  tibble        3.2.1   2023-03-20 [1] RSPM (R 4.4.3)
#  tidyselect    1.2.1   2024-03-11 [1] RSPM (R 4.4.3)
#  tzdb          0.4.0   2023-05-12 [1] RSPM (R 4.4.0)
#  utf8          1.2.4   2023-10-22 [1] RSPM (R 4.4.3)
#  vctrs         0.6.5   2023-12-01 [1] RSPM (R 4.4.3)
#  vroom         1.6.5   2023-12-05 [1] RSPM (R 4.4.0)
#  withr         3.0.2   2024-10-28 [1] RSPM (R 4.4.3)
#  [1] /Users/r2/Library/R/4.4
#  [2] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
# ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────

(Other users on SO reported the same behavior on this link. I'm posting here as a courtesy.)

Apr 23 '25 00:04 r2evans

It is not unique to vroom, some sleuthing (by margusl) found the internal use of gzcon with a url(.) or curl::curl(.) connection, both returning curtailed data.

I'm keeping this issue open in case there is an intelligent way to work around or preempt this problem on known connection types.

Apr 24 '25 01:04 r2evans