vroom icon indicating copy to clipboard operation
vroom copied to clipboard

Error when using numeric reference in col_select and id in read_csv

Open usrbinr opened this issue 3 years ago • 2 comments

I have a strange error when I try to use the id argument in read_csv together with the col_select() but only if I refer to the columns using their numeric position eg.c(1:3)

Specifically when you import multiple csv files from a file path using readr::read_csv, use the id argument and use the col_select argument by referencing the columns by position in either in total or in a non-continuous way then there is either an error or it will drop a column.

However if we delete the id argument, and use the same numeric reference then it will work.

There are many workarounds to this (eg. referring to the columns by name, etc) however sometimes I'm referring to 100's of columns so using numeric position is very handy.

Please see below for reproducible example:

df1 <- data.frame(a=1:10,
                 b=letters[1:10],
                 c=1:10,
                 d=1:10,row.names = F)
  
df2 <- data.frame(a=11:20,
                  b=letters[11:20],
                  c=11:20,
                  d=11:20,row.names = F)


write.csv(x = df1,"df1.csv",row.names = F)
write.csv(x = df2,"df2.csv",row.names=F)

files <- c("df1.csv","df2.csv")

#works, brings in everything
read_csv(files,id = "source")

#works but is missing column "d"
read_csv(files,col_select = 1:4,id="source")

#Does not work, "Error: argument of length 0"; try reference "5th" column but it doesn't exist
read_csv(files,col_select = 1:5,id="source")

#does not work,Error: Can't subset columns that don't exist.                                                                                                                
read_csv(files,col_select =c(1,3:4),id="source")

#however this works, notice the missing id argument
read_csv(files,col_select =c(1,3:4))


usrbinr avatar Mar 26 '22 18:03 usrbinr

I wanted to add a relevant error onto this. I have no problem using col_select with the variable names, e.g.:

data <- read_csv(files,col_select=(starts_with("c"):last_col()),id="source")

However, when I do this, the id function doesn't work. There's no error message, but the id variable isn't added. It is only properly added when I don't use the col_select argument.

data <- read_csv(files,id="source")

melissagwolf avatar Jul 22 '22 07:07 melissagwolf

@melissagwolf For now, the id column needs to be named in col_select to be included in the output, like this:

out1 <- glue::glue("a,b,c,d
                 1,2,3,4")
out2 <- glue::glue("a,b,c,d
                 5,6,7,8")

tf1 <- withr::local_tempfile(fileext = ".csv", lines = out1)
tf2 <- withr::local_tempfile(fileext = ".csv", lines = out2)

vroom(
  c(tf1, tf2),
  id = "source",
  col_select = c("source", "a", "b"),
  show_col_types = FALSE
) %>% mutate(source = basename(source))
#> # A tibble: 2 × 3
#>   source                    a     b
#>   <chr>                 <dbl> <dbl>
#> 1 file16c7378aa3650.csv     1     2
#> 2 file16c736ed3fc13.csv     5     6

Created on 2022-08-11 by the reprex package (v2.0.1.9000)

This is likely to change to allow the behavior you expected and we are tracking that issue at #416

sbearrows avatar Aug 11 '22 21:08 sbearrows