incomplete download from URL
Motived by https://stackoverflow.com/questions/79587028/read-csv-only-reads-a-fraction-of-rows-from-a-zipped-file-when-reading-from-url
A gzipped file is only being downloaded partially, whereas reading in the pre-downloaded file reads in every row. Notice 429497 rows when loaded as a local file, 1798 when loaded by URL.
download.file("https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_station/USW00014839.csv.gz", "USW00014839.csv.gz")
# trying URL 'https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_station/USW00014839.csv.gz'
# Content type 'application/gzip' length 1699132 bytes (1.6 MB)
# ==================================================
# downloaded 1.6 MB
vroom::vroom("USW00014839.csv.gz", col_names = FALSE)
# Warning: One or more parsing issues, call `problems()` on your data frame for details, e.g.:
# dat <- vroom(...)
# problems(dat)
# Rows: 429498 Columns: 8
# ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
# Delimiter: ","
# chr (4): X1, X3, X5, X7
# dbl (3): X2, X4, X8
# lgl (1): X6
# ℹ Use `spec()` to retrieve the full column specification for this data.
# ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# # A tibble: 429,498 × 8
# X1 X2 X3 X4 X5 X6 X7 X8
# <chr> <dbl> <chr> <dbl> <chr> <lgl> <chr> <dbl>
# 1 USW00014839 19380401 TMAX 22 <NA> NA X NA
# 2 USW00014839 19380402 TMAX 6 <NA> NA X NA
# 3 USW00014839 19380403 TMAX 33 <NA> NA X NA
# 4 USW00014839 19380404 TMAX 33 <NA> NA X NA
# 5 USW00014839 19380405 TMAX 0 <NA> NA X NA
# 6 USW00014839 19380406 TMAX 0 <NA> NA X NA
# 7 USW00014839 19380407 TMAX 17 <NA> NA X NA
# 8 USW00014839 19380408 TMAX 22 <NA> NA X NA
# 9 USW00014839 19380409 TMAX 56 <NA> NA X NA
# 10 USW00014839 19380410 TMAX 117 <NA> NA X NA
# # ℹ 429,488 more rows
# # ℹ Use `print(n = ...)` to see more rows
vroom::vroom("https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_station/USW00014839.csv.gz", col_names = FALSE)
# Warning: One or more parsing issues, call `problems()` on your data frame for details, e.g.:
# dat <- vroom(...)
# problems(dat)
# Rows: 1798 Columns: 8
# ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
# Delimiter: ","
# chr (3): X1, X3, X7
# dbl (2): X2, X4
# lgl (3): X5, X6, X8
# ℹ Use `spec()` to retrieve the full column specification for this data.
# ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# # A tibble: 1,798 × 8
# X1 X2 X3 X4 X5 X6 X7 X8
# <chr> <dbl> <chr> <dbl> <lgl> <lgl> <chr> <lgl>
# 1 USW00014839 19380401 TMAX 22 NA NA X NA
# 2 USW00014839 19380402 TMAX 6 NA NA X NA
# 3 USW00014839 19380403 TMAX 33 NA NA X NA
# 4 USW00014839 19380404 TMAX 33 NA NA X NA
# 5 USW00014839 19380405 TMAX 0 NA NA X NA
# 6 USW00014839 19380406 TMAX 0 NA NA X NA
# 7 USW00014839 19380407 TMAX 17 NA NA X NA
# 8 USW00014839 19380408 TMAX 22 NA NA X NA
# 9 USW00014839 19380409 TMAX 56 NA NA X NA
# 10 USW00014839 19380410 TMAX 117 NA NA X NA
# # ℹ 1,788 more rows
# # ℹ Use `print(n = ...)` to see more rows
The problems() suggestion does not help much, it's mostly about a few parsing errors indicating an incorrect guess at lgl:
# # A tibble: 3 × 5
# row col expected actual file
# <int> <int> <chr> <chr> <chr>
# 1 15 6 1/0/T/F/TRUE/FALSE I ""
# 2 44 6 1/0/T/F/TRUE/FALSE I ""
# 3 858 6 1/0/T/F/TRUE/FALSE S ""
I tested at least one other gzipped file on that site (USW00014838.csv.gz), and it shows the same behavior, albeit at a different cut point (6628 of 204830 rows). It does not appear to be a clear networking issue, as both download.file (shown above) and data.table::fread (not shown) read in the entire file (both, actually) without message/warning/error, and it produces the same number of rows each time (multiple users, likely multiple OSes but not verified).
session_info()
sessioninfo::session_info(pkgs = "vroom")
# ─ Session info ────────────────────────────────────────────────────────────────────────────────────────────────────
# setting value
# version R version 4.4.3 (2025-02-28)
# os macOS Sequoia 15.4.1
# system aarch64, darwin20
# ui X11
# language (EN)
# collate en_US.UTF-8
# ctype en_US.UTF-8
# tz America/New_York
# date 2025-04-22
# pandoc 3.6.4 @ /opt/homebrew/opt/pandoc/bin/ (via rmarkdown)
# quarto 1.6.43 @ /usr/local/bin/quarto
# ─ Packages ────────────────────────────────────────────────────────────────────────────────────────────────────────
# package * version date (UTC) lib source
# bit 4.5.0.1 2024-12-03 [1] RSPM (R 4.4.0)
# bit64 4.6.0-1 2025-01-16 [1] RSPM (R 4.4.0)
# cli 3.6.4 2025-02-13 [1] RSPM (R 4.4.3)
# cpp11 0.5.1 2024-12-04 [1] RSPM (R 4.4.0)
# crayon 1.5.3 2024-06-20 [1] RSPM (R 4.4.3)
# fansi 1.0.6 2023-12-08 [1] RSPM (R 4.4.3)
# glue 1.8.0 2024-09-30 [1] RSPM (R 4.4.3)
# hms 1.1.3 2023-03-21 [1] RSPM (R 4.4.0)
# lifecycle 1.0.4 2023-11-07 [1] RSPM (R 4.4.3)
# magrittr 2.0.3 2022-03-30 [1] RSPM (R 4.4.3)
# pillar 1.10.1 2025-01-07 [1] RSPM (R 4.4.0)
# pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.4.3)
# prettyunits 1.2.0 2023-09-24 [1] RSPM (R 4.4.3)
# progress 1.2.3 2023-12-06 [1] RSPM (R 4.4.0)
# R6 2.6.0 2025-02-12 [1] RSPM (R 4.4.0)
# rlang 1.1.5 2025-01-17 [1] RSPM (R 4.4.0)
# tibble 3.2.1 2023-03-20 [1] RSPM (R 4.4.3)
# tidyselect 1.2.1 2024-03-11 [1] RSPM (R 4.4.3)
# tzdb 0.4.0 2023-05-12 [1] RSPM (R 4.4.0)
# utf8 1.2.4 2023-10-22 [1] RSPM (R 4.4.3)
# vctrs 0.6.5 2023-12-01 [1] RSPM (R 4.4.3)
# vroom 1.6.5 2023-12-05 [1] RSPM (R 4.4.0)
# withr 3.0.2 2024-10-28 [1] RSPM (R 4.4.3)
# [1] /Users/r2/Library/R/4.4
# [2] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
# ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(Other users on SO reported the same behavior on this link. I'm posting here as a courtesy.)