readr
readr copied to clipboard
Ignore missing/duplicate names if column is skipped
When I read a file with trailing delimiters, read_csv spits out a warning that a missing column name was filled in. Is there a way to tell the function that I want to read in all but the last (empty) column so that the warning message is not produced? I don't know how common such (malformed) CSV files are, but an option to ignore trailing delimiters might be useful. I tried to get it to work with the col_types argument, but it seems like all columns are read in at first. See also my question on StackOverflow.
You can get the result you want by explicitly skipping that column. Here is one way, but there are some others, such as using cols_only(). Apparently you still get the warning. Perhaps that should be phrased differently, because you have declared your desire to skip this last variable.
library(readr)
read_csv("X1,X2,\nhi,there,\n", col_types = "cc_")
#> Warning: Missing column names filled in: 'X3' [3]
#> # A tibble: 1 × 2
#> X1 X2
#> <chr> <chr>
#> 1 hi there
Hm. It seems like skipping columns always occurs after all data has been read, which is why the warning makes sense if you know how read_csv works. If you naively assume that skipped columns do not influence the result, it seems a bit odd to see this warning.
The same thing is true for skipping columns in arbitrary positions if they don't have values at all, for example:
library(readr)
read_csv("X1,,X2\n1,,2\n3,,4", col_types="i_i")
This results in two warnings because first the missing column automatically gets renamed to X2, and the existing X2 columns gets renamed to X2_1 to avoid a duplicate name. I guess this is not what most users would expect. Of course this can be solved by explicitly specifying column names like this:
read_csv("X1,,X2\n1,,2\n3,,4", col_types="i_i", col_names=c("X1", "X2"), skip=1)
Considering the behavior above, I was expecting to supply 3 column names - but this doesn't work and I only have to specify the names for the used columns.
I'm really just starting to use readr, so it might be my lack of experience. But maybe all of this could be fixed by having an option to ignore consecutive delimiters (which might include a trailing/leading delimiter as a special case). Or maybe people should try to format their CSVs properly before loading and this is not within readr's scope at all, I don't know.
I think this is a problem to do with automatically renaming columns that are then skipped. An option to skip consecutive delimiters seems dangerous to me.
library(readr)
read_csv("X1,\nhi", col_types = "c_")
#> Warning: Missing column names filled in: 'X2' [2]
#> Warning: 1 parsing failure.
#> row col expected actual
#> 1 -- 2 columns 1 columns
#> # A tibble: 1 × 1
#> X1
#> <chr>
#> 1 hi
read_csv("X2,\nhi", col_types = "c_")
#> Warning: Missing column names filled in: 'X2' [2]
#> Warning: Duplicated column names deduplicated: 'X2' => 'X2_1' [2]
#> Warning: 1 parsing failure.
#> row col expected actual
#> 1 -- 2 columns 1 columns
#> # A tibble: 1 × 1
#> X2
#> <chr>
#> 1 hi
There is a bit of a chicken and egg problem here, standardising column types needs column names sorted out first, but if column names depend on skipped columns ☹️ .
It can be done I am sure, but will likely take some refactoring of col_spec_standardise.
FWIW I have the same problem in readxl. Also unsolved. We should talk/commiserate about this @jimhester, to harmonize the solutions as much as possible.
I just stumbled over this issue again. I'm reading a CSV file with an extra delimiter at the end of each line (so read_csv spits out a warning "Missing column names filled in: 'X56' [56]"). This happens even though I'm passing col_types=cols_only(...), where I only specify a subset of column names.
Short example:
read_csv("X1,X2,\nhi,there,\n",
col_types=cols_only(X1=col_character(),
X2=col_character()))
Since I explicitly state which columns I want to load, the warning is a bit irritating. Would it be possible to not issue the warning if I haven't explicitly selected it? Otherwise, wrapping everything in withCallingHandlers and suppressing that specific warning gets really unreadable:
withCallingHandlers({
read_csv("X1,X2,\nhi,there,\n",
col_types=cols_only(X1=col_character(),
X2=col_character()))
},
warning=function(w) {if (startsWith(conditionMessage(w), "Missing column names"))
invokeRestart("muffleWarning")})
Or maybe read_csv could have a suppressWarnings argument that also accepts a regex to suppress specific warnings I know I'm going to ignore?
I've got the same problem with read_delim N_CSV <- read_delim("~/Documents/RFolder/BOGEN/Bogen_Kapt3/Data/NormalTusindSepDel.csv", delim =",")%>% slice(-1) Parsed with column specification: cols( i = col_double(), t = col_character(), TRH = col_character(), pTSH = col_character(), TSH = col_character(), TT4 = col_character(), FT4 = col_character(), TT3 = col_character(), FT3 = col_character(), cT3 = col_character(), X11 = col_logical() ) Warning message: Missing column names filled in: 'X11' [11]
I just saw that readr::read_delim() has a skip_empty_rows parameter. Maybe introducing a skip_empty_cols parameter would make sense?