readxl icon indicating copy to clipboard operation
readxl copied to clipboard

Add column specification as in readr

Open dpprdan opened this issue 8 years ago • 16 comments

It would be nice to have the same column guessing with the same syntax as in readr. https://github.com/hadley/readr/releases/tag/v1.0.0

dpprdan avatar Aug 25 '16 12:08 dpprdan

A specific request that appears in #81, at the very least, is to make it easy to set one column type for all columns. Done in readr via, e.g., cols(.default = "c").

Updated: now implemented within the existing readxl style, i.e. if col_types has length one it is recycled for all columns.

jennybc avatar Feb 05 '17 05:02 jennybc

It would be nice to specify col type for certain columns and allow others to be guessed.

Update: now implemented within the existing readxl style.

jennybc avatar Feb 17 '17 14:02 jennybc

It would be nice to know whether a col type was guessed or specified by user. Conceivably we would warn less (and maybe even coerce more aggressively?) if the type was requested vs. guessed.

jennybc avatar Mar 04 '17 17:03 jennybc

I've edited the title to reflect my current thinking: it would be good to align with readr re: explicit column specification (including skipping and guessing, which is already supported). But I don't ever see doing readr-style column guessing. readr guesses based on data, which it must. readxl guesses based on Excel types, which -- I argue -- it must.

jennybc avatar Oct 27 '17 20:10 jennybc

Setting aside the guessing of formats, it would be good to at least align the syntax with readr. At the moment read_excel uses col_types "text" for example, whereas readr functions refer to "character".

nickbond avatar Sep 10 '18 01:09 nickbond

I asked in #571 to be able to supply a long list of named col_type and use it when the name of the column matches a name on the col_type argument. If I understood correctly readr col_type argument if a column name is not present it doesn't try to guess the type of the column.

library("readr")
mtcars_csv <- system.file("extdata", "mtcars.csv", package="readr")
col_types_list <- cols_only(
    mpg = col_double(),
    cyl = col_integer(),
    disp = col_double()
)
df1 <- read_csv(mtcars_csv, col_types=col_types_list)
ncol(df1) == 3
df2 <- read_csv(mtcars_csv)
ncol(df2) == 11

I think that readxl should try to guess the column types if the type is not provided. Thanks!

llrs avatar May 25 '19 07:05 llrs

The ability to specify column types by name is absolutely key for large data sets with many columns.

In my case, I'm importing internal enterprise data reports. Read_excel is currently mangling the date-time objects in it, and the ability to go back and tweak just that column would be absolutely invaluable. Otherwise, I'm having to go back and hand-define hundreds of columns just to redefine one.

TreyRoady avatar Nov 04 '19 23:11 TreyRoady

This is on the roadmap for my next several months. The plan is to do a major upgrade of the col spec here at the same time as adding similar functionality to googlesheets4.

jennybc avatar Nov 05 '19 01:11 jennybc

@llrs

If I understood correctly readr col_type argument if a column name is not present it doesn't try to guess the type of the column.

This depends on whether you specify the column types list (or col_spec) with cols_only() or cols().

  1. If you specify it with cols_only(), like you did in your example, columns that are not present in your specification are skipped (cols_only() implicitly sets .default = col_skip()).
  2. If you specify the col_spec with cols(), columns that are not present in your specification are guessed (cols() implicitly sets .default = col_guess()).

See also https://readr.tidyverse.org/reference/cols.html

dpprdan avatar Nov 05 '19 10:11 dpprdan

Thanks for the clarification @dpprdan

llrs avatar Nov 05 '19 10:11 llrs

Any updates?

kristjan-kure avatar Mar 29 '22 20:03 kristjan-kure

https://github.com/tidyverse/tidyverse.org/pull/569

What's coming next? I won't go so far as to promise that 2022 is the year of readxl 😉. But I can say that top priorities include equipping readxl with better problem reporting and column specification, making its interface feel more similar to that of readr and vroom.

jennybc avatar Mar 29 '22 20:03 jennybc

As of version 1.4.2, the docs say col_types takes only a character vector. Is there a possibility to take a named list, guessing for the unspecified columns?

jxu avatar Apr 28 '23 02:04 jxu

Until this is not implemented, a workaround I am adopting is using the functions below:

define_readxl_types <- function(col_nms, vec_names, vec_types, default_type = "guess") {
  #get names from earlier readxl without types
  cols <- dplyr::tibble(
    name = col_nms,
    index = seq(1:length(col_nms)),
    type = default_type
  )
  
  #build names and types
  col_def <- dplyr::tibble(
    name = vec_names,
    type = vec_types
  )
  
  #get indexes
  col_def$index <- purrr::map_vec(col_def$name, ~match(.x, cols$name))
  
  #iterate defined types
  for (i in 1:length(col_def)) {
    cols$type[col_def$index[i]] <- col_def$type[i]
  }
  
  #return types vector
  return(cols$type)
}

Probably this is not the best solution, best implementation. It works, though.

Example using readxl example:

> original <- read_excel(readxl_example("deaths.xlsx"), skip = 4)
> str(original)
tibble [14 × 6] (S3: tbl_df/tbl/data.frame)
 $ Name         : chr [1:14] "David Bowie" "Carrie Fisher" "Chuck Berry" "Bill Paxton" ...
 $ Profession   : chr [1:14] "musician" "actor" "musician" "actor" ...
 $ Age          : chr [1:14] "69" "60" "90" "61" ...
 $ Has kids     : chr [1:14] "TRUE" "TRUE" "TRUE" "TRUE" ...
 $ Date of birth: POSIXct[1:14], format: "1947-01-08" "1956-10-21" "1926-10-18" "1955-05-17" ...
 $ Date of death: chr [1:14] "42379" "42731" "42812" "42791" ...

> solution <- read_excel_with_some_types(
+   readxl_example("deaths.xlsx"), skip = 4,
+   vec_names =  c("Age", "Has kids"), vec_types = c("numeric", "logical") )
Warning messages:
1: Expecting numeric in C18 / R18C3: got 'at the' 
2: Expecting logical in D18 / R18C4: got 'bottom,' 
> str(solution)
tibble [14 × 6] (S3: tbl_df/tbl/data.frame)
 $ Name         : chr [1:14] "David Bowie" "Carrie Fisher" "Chuck Berry" "Bill Paxton" ...
 $ Profession   : chr [1:14] "musician" "actor" "musician" "actor" ...
 $ Age          : num [1:14] 69 60 90 61 57 69 82 89 99 53 ...
 $ Has kids     : logi [1:14] TRUE TRUE TRUE TRUE TRUE FALSE ...
 $ Date of birth: POSIXct[1:14], format: "1947-01-08" "1956-10-21" "1926-10-18" "1955-05-17" ...
 $ Date of death: chr [1:14] "42379" "42731" "42812" "42791" ...

Didn't tested it extensively, give it a try.

diegomsg avatar Aug 11 '23 14:08 diegomsg

@diegomsg if you want the interface to be consistent with readr, the function should take a named list as col guesses

jxu avatar Aug 11 '23 21:08 jxu

Echoing @nickbond, I agree on having the syntax match readr. I don't have opinions on which way, but given the same argument name col_types (and Jenny's affiliation), I expected allowable values for col_type to be the same.

asadow avatar Oct 05 '23 18:10 asadow