readr icon indicating copy to clipboard operation
readr copied to clipboard

claimed parsing issues don't flow through purrr::map()

Open twest820 opened this issue 10 months ago • 0 comments

Guidance in #1472 is to read heterogenous data with purrr::map(). However, with at least some datasets, this approach either blocks vroom's mechanisms for reporting parsing issues or results in spurious warnings about parse errors.

This behavior easily results in difficulty attempting to determine whether there is or is not an issue somewhere within a set of hundreds of files. It'd therefore be helpful if there was mechanism to flow the problems and indicate which files they occurred in.

# download the four quarters of 2018 data from https://www.backblaze.com/cloud-storage/resources/hard-drive-test-data and extract the 365 .csv files
# .csv files for 2013-2017 are homogenous and don't raise a warning, 2019 is also heterogeneous
yearFiles = list.files("2018", "*.csv", full.names = TRUE)
yearData = purrr::map(yearFiles, \(file) read_csv(file, col_types = cols_only(date = "D", serial_number = "c", model = "c", failure = "l"), col_select = c("date", "serial_number", "model", "failure")))
# Warning message:                                                                                                                             
# One or more parsing issues, call `problems()` on your data frame for details, e.g.:
#   dat <- vroom(...)
#   problems(dat) 

problems(yearData) # doesn't print anything, not surprising since yearData's a list
lapply(yearData, problems) # returns only empty tibbles
for (index in 1:length(yearData))
{
  problems(yearData[[index]]) # doesn't print anything for any day of the year
}

# also produces a warning message but no problems are printed
for (index in 1:length(yearFiles))
{
  dayData = read_csv(yearFiles[index], col_types = cols_only(date = "D", serial_number = "c", model = "c", failure = "l"), col_select = c("date", "serial_number", "model", "failure"))
  problems(dayData)
}

It'd probably be good to also support problem flow through bind_rows(purrr::map()) as I suspect that's a common pattern.

twest820 avatar Feb 07 '25 19:02 twest820