readxl icon indicating copy to clipboard operation
readxl copied to clipboard

If no col name, fallback to Excel's letter-based column names

Open N1h1l1sT opened this issue 5 years ago • 1 comments

When an excel file doesn't have Column Names, the assigned colnames of the df being the same as that of the excel file. That would also mean that if we're starting from E5 cell, then the returned dataframe's first column name should also be "e". This way, for instance, when multiple people work on the same excel from a different angle (1 directly and 1 on R), they can easily refer to columns.

Yo, Gianni... remove column F


What happens at the moment

readxl::read_excel(
  "SpreadSheet.xlsx",
  range = cell_limits(c(2, 5), c(NA, NA))
) %>%
  select(-X__2)
x__1 x__3
2 2
3 3

What I suggest:

readxl::read_excel(
  "SpreadSheet.xlsx",
  col_names = FALSE,
  use_excel_col_names = TRUE,
  range = cell_limits(c(2, 5), c(NA, NA))
) %>%
  select(-f)

or

readxl::read_excel(
  "SpreadSheet.xlsx",
  col_names = "excel",
  range = cell_limits(c(2, 5), c(NA, NA))
) %>%
  select(-f)
e g
2 2
3 3

Much easier to understand which columns you're looking at and compare with the excel file.

N1h1l1sT avatar Nov 24 '18 19:11 N1h1l1sT

A lightweight solution for this in "user space" is to pass a custom function to the new .name_repair argument in the dev version of readxl.

devtools::load_all(".")
#> Loading readxl

## readxl has a test sheet with missing column names
## this is an unexported helper that makes test sheets available
read_excel(test_sheet("unnamed-duplicated-columns.xlsx"))
#> New names:
#> * `` -> `..1`
#> * var2 -> var2..2
#> * `` -> `..3`
#> * var2 -> var2..4
#> # A tibble: 2 x 4
#>     ..1 var2..2   ..3 var2..4
#>   <dbl> <chr>   <dbl> <chr>  
#> 1     1 a         1.1 aa     
#> 2     2 b         2.1 bb

LETTER_names <- function(nms) LETTERS[seq_along(nms)]

read_excel(
  test_sheet("unnamed-duplicated-columns.xlsx"),
  .name_repair = LETTER_names
)  
#> # A tibble: 2 x 4
#>       A B         C D    
#>   <dbl> <chr> <dbl> <chr>
#> 1     1 a       1.1 aa   
#> 2     2 b       2.1 bb

## same, but via anonymous function
read_excel(
  test_sheet("unnamed-duplicated-columns.xlsx"),
  .name_repair = ~ LETTERS[seq_along(.x)]
)  
#> # A tibble: 2 x 4
#>       A B         C D    
#>   <dbl> <chr> <dbl> <chr>
#> 1     1 a       1.1 aa   
#> 2     2 b       2.1 bb

Created on 2018-12-13 by the reprex package (v0.2.1.9000)

The bummer is that, in user space, you don't know where the data was found in the sheet, i.e. if empty columns were skipped. I'll have to ponder what would be necessary to do this properly and whether it feels worth it.

jennybc avatar Dec 13 '18 21:12 jennybc