Too many delimiters in a row causes final column to be "overstuffed" with values
Vroom doesn't fail, stop, or raise any errors when a file has a row with more columns than expected. Instead, any remaining values (separator and all) are forced into the final column of the output. A warning is given, but it's cryptic.
Take this tsv file:
num chr num2 num3
1 charab 123 434
2 charact 345 2345
3 chaaa 3123 1231
The function used and following result:
> vroom::vroom("test.tsv")
Rows: 3 Columns: 4
── Column specification ──────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): chr, num2, num3
dbl (1): num
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 4
num chr num2 num3
<dbl> <chr> <chr> <chr>
1 1 charab 123 "434"
2 2 charact NA "345\t2345"
3 3 NA chaaa "3123\t1231"
Warning message:
One or more parsing issues, see `problems()` for details
> problems()
Error in vroom_materialize(x, replace = FALSE) :
argument "x" is missing, with no default
The output I would expect is a more descriptive error, like data.table::fread() gives:
num chr num2 num3
1: 1 charab 123 434
Warning message:
In data.table::fread("test.tsv") :
Stopped early on line 3. Expected 4 fields but found 5. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<2 charact 345 2345>>
And either raising an error, discarding the offending rows, or stopping the read after the first offending row.
Upon reading into problems(), and passing the returned array as an argument, the error descriptions are sufficient, though a big obfuscated for my tastes. It would still be nice to have the offending rows dealt with in some other way then forcing all values into the last column.
Your data would actually be read correctly with readr::read_table() which handles whitespace delimited files with any number of whitespace characters between columns. Unfortunately, we are not currently pursuing replicating this feature in vroom (see https://github.com/tidyverse/vroom/issues/186).
text <- glue::glue(
'x\ty\tz\n
1\t2\t\t3\n
4\t\t5\t6\n')
tf <- withr::local_tempfile(lines = text)
# read_table() handles this messy data
readr::read_table(tf, show_col_types = FALSE)
#> # A tibble: 2 × 3
#> x y z
#> <dbl> <dbl> <dbl>
#> 1 1 2 3
#> 2 4 5 6
Created on 2022-08-26 by the reprex package (v2.0.1.9000)