vroom icon indicating copy to clipboard operation
vroom copied to clipboard

Too many delimiters in a row causes final column to be "overstuffed" with values

Open TMRHarrison opened this issue 3 years ago • 1 comments

Vroom doesn't fail, stop, or raise any errors when a file has a row with more columns than expected. Instead, any remaining values (separator and all) are forced into the final column of the output. A warning is given, but it's cryptic.

Take this tsv file:

num     chr     num2    num3
1       charab  123     434
2       charact         345     2345
3               chaaa   3123    1231

The function used and following result:

> vroom::vroom("test.tsv")
Rows: 3 Columns: 4
── Column specification ──────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): chr, num2, num3
dbl (1): num

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 4
    num chr     num2  num3
  <dbl> <chr>   <chr> <chr>
1     1 charab  123   "434"
2     2 charact NA    "345\t2345"
3     3 NA      chaaa "3123\t1231"
Warning message:
One or more parsing issues, see `problems()` for details
> problems()
Error in vroom_materialize(x, replace = FALSE) :
  argument "x" is missing, with no default

The output I would expect is a more descriptive error, like data.table::fread() gives:

   num    chr num2 num3
1:   1 charab  123  434
Warning message:
In data.table::fread("test.tsv") :
  Stopped early on line 3. Expected 4 fields but found 5. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<2	charact		345	2345>>

And either raising an error, discarding the offending rows, or stopping the read after the first offending row.

TMRHarrison avatar May 12 '22 18:05 TMRHarrison

Upon reading into problems(), and passing the returned array as an argument, the error descriptions are sufficient, though a big obfuscated for my tastes. It would still be nice to have the offending rows dealt with in some other way then forcing all values into the last column.

TMRHarrison avatar May 13 '22 16:05 TMRHarrison

Your data would actually be read correctly with readr::read_table() which handles whitespace delimited files with any number of whitespace characters between columns. Unfortunately, we are not currently pursuing replicating this feature in vroom (see https://github.com/tidyverse/vroom/issues/186).

text <- glue::glue(
'x\ty\tz\n
1\t2\t\t3\n
4\t\t5\t6\n')

tf <- withr::local_tempfile(lines = text)

# read_table() handles this messy data
readr::read_table(tf, show_col_types = FALSE)
#> # A tibble: 2 × 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
#> 2     4     5     6

Created on 2022-08-26 by the reprex package (v2.0.1.9000)

sbearrows avatar Aug 26 '22 23:08 sbearrows