readr
readr copied to clipboard
Name repair for duplicated columns inconsistent between `read_csv` and `spec_csv`
When trying to create column specifications of a file that contains duplicated variable names, spec_csv() renames the variables differently (e.g "error_1") than read_csv() ("error...1").
Let test.csv be a simple CSV with duplicated variables "error" and one observation:
echo "a,error,b,error,c,error\n1,string1,2,string2,3,string3" >> test.csv
readr::spec_csv("test.csv")
cols(
a = col_double(),
error = col_character(),
b = col_double(),
error_1 = col_character(),
c = col_double(),
error_2 = col_character()
)
Warning message:
Duplicated column names deduplicated: 'error' => 'error_1' [4], 'error' => 'error_2' [6]
readr::read_csv("test.csv")
New names:
* error -> error...2
* error -> error...4
* error -> error...6
Rows: 1 Columns: 6
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): error...2, error...4, error...6
dbl (3): a, b, c
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 1 × 6
a error...2 b error...4 c error...6
<dbl> <chr> <dbl> <chr> <dbl> <chr>
1 1 string1 2 string2 3 string3
NB: I dug through some readr code and it seems to me that read_csv and spec_csv are not using the same code-base: spec_csv uses the function col_spec_standardise where duplicated column names are dealt with:
if (anyDuplicated(col_names)) {
dups <- duplicated(col_names)
old_names <- col_names
col_names <- make.unique(col_names, sep = "_")
warning(
"Duplicated column names deduplicated: ",
paste0(
encodeString(old_names[dups], quote = "'"),
" => ",
encodeString(col_names[dups], quote = "'"),
" [", which(dups), "]",
collapse = ", "
),
call. = FALSE
)
}
And read_csv is essentially a wrapper for vroom::vroom which then deals with column names (in the C++ code part I think?).
It seems to me that the code to make unique names for duplicated column names is not correct and should follow the same pattern as vroom/vroom_ or vice versa.
Edit:
read_csv uses vctrs::vec_as_names to create syntactically valid names (in col_types_standardise). Thus spec_csv should use the same call.
Sometimes I know the duplicated columns exisited, but I just wanna ignore it. However, with the current defualt name_repair, looking like "num...1", I cannot ignore it, I have to deal with it.
Note to self:
Right now spec_csv() is going through edition 1 read_csv() which ends up in col_spec_standardise and that has it's own custom name repair machinery https://github.com/tidyverse/readr/blob/85cf1e8e664e2a75eba393f91d89b39060140dab/R/col_types.R#L441-L458
The fact that we force spec_csv() to edition 1 is due to: https://github.com/tidyverse/readr/commit/c88303c8f62478d752ec21381feb80bccebb619d
We don't see a way to allow spec_csv() to use edition 2 because vroom doesn't support guessing without parsing the data (as the above commit mentions).
value <- I("a,error,b,error,c,error\n1,string1,2,string2,3,string3")
# note that edition 1 actually guesses column types
readr::with_edition(1, readr::read_csv(value, n_max = 0, guess_max = 1000))
#> Warning: Duplicated column names deduplicated: 'error' => 'error_1' [4], 'error'
#> => 'error_2' [6]
#> # A tibble: 0 × 6
#> # … with 6 variables: a <dbl>, error <chr>, b <dbl>, error_1 <chr>, c <dbl>,
#> # error_2 <chr>
#> # ℹ Use `colnames()` to see all variable names
# while edition 2 returns all characters
readr::read_csv(value, n_max = 0, guess_max = 1000)
#> New names:
#> Rows: 0 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (6): a, error...2, b, error...4, c, error...6
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`
#> # A tibble: 0 × 6
#> # … with 6 variables: a <chr>, error...2 <chr>, b <chr>, error...4 <chr>,
#> # c <chr>, error...6 <chr>
#> # ℹ Use `colnames()` to see all variable names
Created on 2022-08-23 by the reprex package (v2.0.1.9000)
There are three options for moving forward:
- Bring over the name repair machinery from vroom a la https://github.com/tidyverse/vroom/blob/339073c4fc5cfdcbeb29ff716d41c2aaaf6bad43/R/col_types.R#L389 and add the
name_repairargument to all instances whereread_delimited()is called. This would require a rather large change to the code base but it would mean thename_repairargument would now be respected for edition 1read_csv()(and friends). - Bring over the name repair machinery from vroom as above, but force
name_repair = uniqueand continue to not respectname_repair. It would makespec_csv()andread_csv()perform the same for default name repair and we wouldn't need to change a large part of the code base. We currently don't respectname_repairin edition 1, so continuing this behavior might not be a big deal. - Do nothing because the (potentially large) changes would occur in edition 1, which is feature frozen or the changes wouldn't be a perfect solution due to
name_repairstill not being respected. We also don't feel like name repair for duplicate columns names is a safe or advised maneuver to rely on when copying a spec fromspec_csv()for use inread_csv().
It seems like a stopgap readr 2e / vroom version of spec_csv() could just call read_csv() and then return the spec from that result.
(BTW various things seem wonky with this printed output)
library(readr)
value <- I("a,error,b,error,c,error\n1,string1,2,string2,3,string3")
dat <- read_csv(value)
#> New names:
#> Rows: 1 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (3): error...2, error...4, error...6 dbl (3): a, b, c
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`
spec(dat)
#> cols(
#> a = col_double(),
#> error...2 = col_character(),
#> b = col_double(),
#> error...4 = col_character(),
#> c = col_double(),
#> error...6 = col_character()
#> )
readr 2e / vroom column names: a, error...2, b, error...4, c, error...6
What happens if we just ask for the spec from the same input?
spec_csv(value)
#> Warning: Duplicated column names deduplicated: 'error' => 'error_1' [4], 'error'
#> => 'error_2' [6]
#> cols(
#> a = col_double(),
#> error = col_character(),
#> b = col_double(),
#> error_1 = col_character(),
#> c = col_double(),
#> error_2 = col_character()
#> )
Problem: we get readr 1e-style column names: a, error, b, error_1, c, error_2
It does seem reasonable to expect spec_csv() to return the same spec as spec(read_csv()), at least with respect to column names.
A simple fix would be to call read_csv() on the input with n_max = guess_max and then return that spec.
spec_csv_vroom <- function(..., guess_max = 1000) {
tmp <- readr::read_csv(..., n_max = guess_max, guess_max = guess_max)
spec(tmp)
}
spec_csv_vroom(value)
#> New names:
#> Rows: 1 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (3): error...2, error...4, error...6 dbl (3): a, b, c
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`
#> cols(
#> a = col_double(),
#> error...2 = col_character(),
#> b = col_double(),
#> error...4 = col_character(),
#> c = col_double(),
#> error...6 = col_character()
#> )
Is there a reason not to do something along these lines?
Is there a reason not to do something along these lines?
Performance? I guess? i.e. this commit from Jim talks about vroom not supporting guessing without parsing https://github.com/tidyverse/readr/commit/c88303c8f62478d752ec21381feb80bccebb619d
But since spec_csv() and friends are probably relatively rarely used, the benefits of updating them to 2e probably greatly outweigh performance costs - especially since the long term goal is to remove 1e code
Is there a reason not to do something along these lines?
Same as @DavisVaughan, it would solve the problem especially in the long term for removing readr edition 1 code, but having spec_csv() be just a wrapper for spec(read_csv(n_max = guess_max)) seems odd
Yeah the only thing I could think of was performance (or elegance). But I think returning the wrong column names is a much bigger sin that the downsides of a simple spec_*() implementation.
"readr doesn't have to parse to guess col types but vroom does" seems a bit misleading. readr is definitely consulting (tokenizing and typing) up to guess_max rows of data, in order to guess the column types. This particular point doesn't really resonate with me. I suppose if we were truly only column type guessing, we wouldn't load the data atoms into a data frame, but this seems like an awfully subtle thing to be worried about. Does anyone really "get" this point?
having
spec_csv()be just a wrapper forspec(read_csv(n_max = guess_max))seems odd
How so? It seems like the point of spec_csv() is to return the spec associated with the data frame you'd get if you read that file.
I know that spec_csv() is basically already a wrapper for ed1 spec(read_csv(n_max = 0, guess_max = guess_max)) so I guess it's not the "wrapper" bit that feels odd. Maybe I'm just blocked by what I think spec_csv() should be doing (ie guessing column types but not reading anything).
guessing column types but not reading anything
But you can't guess column types without reading from the file.