readxl
readxl copied to clipboard
Add column specification as in readr
It would be nice to have the same column guessing with the same syntax as in readr. https://github.com/hadley/readr/releases/tag/v1.0.0
A specific request that appears in #81, at the very least, is to make it easy to set one column type for all columns. Done in readr via, e.g., cols(.default = "c")
.
Updated: now implemented within the existing readxl style, i.e. if col_types
has length one it is recycled for all columns.
It would be nice to specify col type for certain columns and allow others to be guessed.
Update: now implemented within the existing readxl style.
It would be nice to know whether a col type was guessed or specified by user. Conceivably we would warn less (and maybe even coerce more aggressively?) if the type was requested vs. guessed.
I've edited the title to reflect my current thinking: it would be good to align with readr re: explicit column specification (including skipping and guessing, which is already supported). But I don't ever see doing readr-style column guessing. readr guesses based on data, which it must. readxl guesses based on Excel types, which -- I argue -- it must.
Setting aside the guessing of formats, it would be good to at least align the syntax with readr. At the moment read_excel uses col_types "text" for example, whereas readr functions refer to "character".
I asked in #571 to be able to supply a long list of named col_type and use it when the name of the column matches a name on the col_type argument. If I understood correctly readr col_type argument if a column name is not present it doesn't try to guess the type of the column.
library("readr")
mtcars_csv <- system.file("extdata", "mtcars.csv", package="readr")
col_types_list <- cols_only(
mpg = col_double(),
cyl = col_integer(),
disp = col_double()
)
df1 <- read_csv(mtcars_csv, col_types=col_types_list)
ncol(df1) == 3
df2 <- read_csv(mtcars_csv)
ncol(df2) == 11
I think that readxl should try to guess the column types if the type is not provided. Thanks!
The ability to specify column types by name is absolutely key for large data sets with many columns.
In my case, I'm importing internal enterprise data reports. Read_excel is currently mangling the date-time objects in it, and the ability to go back and tweak just that column would be absolutely invaluable. Otherwise, I'm having to go back and hand-define hundreds of columns just to redefine one.
This is on the roadmap for my next several months. The plan is to do a major upgrade of the col spec here at the same time as adding similar functionality to googlesheets4.
@llrs
If I understood correctly readr col_type argument if a column name is not present it doesn't try to guess the type of the column.
This depends on whether you specify the column types list (or col_spec
) with cols_only()
or cols()
.
- If you specify it with
cols_only()
, like you did in your example, columns that are not present in your specification are skipped (cols_only()
implicitly sets.default = col_skip()
). - If you specify the
col_spec
withcols()
, columns that are not present in your specification are guessed (cols()
implicitly sets.default = col_guess()
).
See also https://readr.tidyverse.org/reference/cols.html
Thanks for the clarification @dpprdan
Any updates?
https://github.com/tidyverse/tidyverse.org/pull/569
What's coming next? I won't go so far as to promise that 2022 is the year of readxl 😉. But I can say that top priorities include equipping readxl with better problem reporting and column specification, making its interface feel more similar to that of readr and vroom.
As of version 1.4.2, the docs say col_types
takes only a character vector. Is there a possibility to take a named list, guessing for the unspecified columns?
Until this is not implemented, a workaround I am adopting is using the functions below:
define_readxl_types <- function(col_nms, vec_names, vec_types, default_type = "guess") {
#get names from earlier readxl without types
cols <- dplyr::tibble(
name = col_nms,
index = seq(1:length(col_nms)),
type = default_type
)
#build names and types
col_def <- dplyr::tibble(
name = vec_names,
type = vec_types
)
#get indexes
col_def$index <- purrr::map_vec(col_def$name, ~match(.x, cols$name))
#iterate defined types
for (i in 1:length(col_def)) {
cols$type[col_def$index[i]] <- col_def$type[i]
}
#return types vector
return(cols$type)
}
Probably this is not the best solution, best implementation. It works, though.
Example using readxl
example:
> original <- read_excel(readxl_example("deaths.xlsx"), skip = 4)
> str(original)
tibble [14 × 6] (S3: tbl_df/tbl/data.frame)
$ Name : chr [1:14] "David Bowie" "Carrie Fisher" "Chuck Berry" "Bill Paxton" ...
$ Profession : chr [1:14] "musician" "actor" "musician" "actor" ...
$ Age : chr [1:14] "69" "60" "90" "61" ...
$ Has kids : chr [1:14] "TRUE" "TRUE" "TRUE" "TRUE" ...
$ Date of birth: POSIXct[1:14], format: "1947-01-08" "1956-10-21" "1926-10-18" "1955-05-17" ...
$ Date of death: chr [1:14] "42379" "42731" "42812" "42791" ...
> solution <- read_excel_with_some_types(
+ readxl_example("deaths.xlsx"), skip = 4,
+ vec_names = c("Age", "Has kids"), vec_types = c("numeric", "logical") )
Warning messages:
1: Expecting numeric in C18 / R18C3: got 'at the'
2: Expecting logical in D18 / R18C4: got 'bottom,'
> str(solution)
tibble [14 × 6] (S3: tbl_df/tbl/data.frame)
$ Name : chr [1:14] "David Bowie" "Carrie Fisher" "Chuck Berry" "Bill Paxton" ...
$ Profession : chr [1:14] "musician" "actor" "musician" "actor" ...
$ Age : num [1:14] 69 60 90 61 57 69 82 89 99 53 ...
$ Has kids : logi [1:14] TRUE TRUE TRUE TRUE TRUE FALSE ...
$ Date of birth: POSIXct[1:14], format: "1947-01-08" "1956-10-21" "1926-10-18" "1955-05-17" ...
$ Date of death: chr [1:14] "42379" "42731" "42812" "42791" ...
Didn't tested it extensively, give it a try.
@diegomsg if you want the interface to be consistent with readr, the function should take a named list as col guesses
Echoing @nickbond, I agree on having the syntax match readr. I don't have opinions on which way, but given the same argument name col_types
(and Jenny's affiliation), I expected allowable values for col_type
to be the same.