send `spec_*()` to vroom
Fixes #1387
Previously, spec_*() was being routed to readr ed1 even when using ed2. This meant that duplicate name repair was not consistent between spec_*() and the equivalent read_*() function.
I've only implemented this for spec_csv() so we can iron things out first.
Carrying a conversation about guessing behaviors in read_*() vs spec_*() over from #1431
One consequence of doing the proposal from #1387 is the following:
Since guess_max doesn't guess column types using rows it will never read, if we complete this PR as intended then n_max = guess_max and guessing for spec_*() will never be done via selecting rows "interspersed throughout the file". For example, if some data has 2000 rows and guess_max = 1000 then n_max = 1000 and it will guess using the first 1000 rows in the data. It'll always use all of the data it is given to guess. Here is an example using the challenge.csv where the column y is filled with NA until row 1001, after which it becomes filled with dates.
# after we complete #1436
spec_csv(readr_example("challenge.csv"), guess_max = 1000) # n_max = 1000
#> cols(
#> x = col_double(),
#> y = col_logical()
#> )
# increasing guess_max does change the spec
spec_csv(readr_example("challenge.csv"), guess_max = 1001) # n_max = 1001
#> cols(
#> x = col_double(),
#> y = col_date(format = "")
#> )
# compared to how guessing works in read_csv
# since it's dispersed throughout the file, it guesses correctly
spec(read_csv(readr_example("challenge.csv"), guess_max = 1000, show_col_types = FALSE)) # n_max = all the data
#> cols(
#> x = col_double(),
#> y = col_date(format = "")
#> )
Not necessarily a non-starter, but food for thought.
More observations:
-
spec_*()probably just has to be documented as being considerably less useful in readr 2e vs 1e. The 1e design made it possible to visit "cells" for guessing without parsing them into a receptacle, whereas 2e does not. So we're going to be making some compromises. I still suspect we should setn_maxtoguess_maxand thatguess_maxshould default to 1000. And document the problem with that. And the user can always setguess_maxto something higher or toInf. - I think (hope?) that the big win here is having the result of
spec_*()and the result ofread_*()be consistent with respect to column names.