readr
readr copied to clipboard
overzealous guessing/parsing for the "number" format based on grouping marks?
I'm encountering situations where the parsing guess rules seem to be overzealously deciding on the "number" category based on the presence of grouping mark[s] (for me, commas)--which in turn is leading to data quality problems during import. Wondering if it would be worth adding a few more checks before a "numbers" guess is made.
An example, similar to what I actually encountered:
### Can work well:
readr::guess_parser("1,234,567") # Fine - 1234567
#> [1] "number"
readr::guess_parser("0,234,567") # Thoughtful--leading zero inconsistent w/idea of "number"
#> [1] "character"
### But:
readr::guess_parser("1,2,4") # Not a standard number (in my locale)
#> [1] "number"
readr::guess_parser("1,2,") # Farther afield
#> [1] "number"
readr::guess_parser("1,2,,,,,4,,,") # Even farther
#> [1] "number"
### Real-world example--encountering data that uses quotes for comma sequestration,
### including groups of numeric reference codes:
csv_dat <-
c(
'char, num, numeric_codes, mixed',
'a, 1, 1, a',
'"oh,my", 2, 2, 2',
'c, 3, "1,23,4", c',
'd, 4, "1,2,3,4", d'
)
### Write it out:
tmp <- tempfile(fileext = ".csv")
csv_dat |>
stringr::str_remove_all(" ") |>
writeLines(file(tmp))
### numeric_codes read back in as "number" thanks to the commas:
dat <- readr::read_csv(tmp)
#> Rows: 4 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): char, mixed
#> dbl (1): num
#> num (1): numeric_codes
### And info can be lost--e.g., in numeric_codes elements 3 and 4 are now indistinguishable:
dat
#> # A tibble: 4 × 4
#> char num numeric_codes mixed
#> <chr> <dbl> <dbl> <chr>
#> 1 a 1 1 a
#> 2 oh,my 2 2 2
#> 3 c 3 1234 c
#> 4 d 4 1234 d
Created on 2023-10-24 with reprex v2.0.2
Of course, one could just explicitly specify columns, or fix everything to character. But I don't always know the complete structure of my data preemptively, so while those are options, they aren't optimal. It would be great to be able to at least partly lean on the (default mode of) guessing to streamline things.
(PS - riffing a bit, but as an alternative, what if it were possible to specify a subset of possible types of data to guess from? For instance, I may be starting with an unknown mix/arrangement of character, numeric, and logical columns--but I know there aren't going to be any pretty-formatted numbers, factors, or times. That's probably a pretty common scenario, I'd think?)