vroom
vroom copied to clipboard
Extraneous whitespace complicates parsing in the presence of quoted fields
Based on issues originally reported in readr: https://github.com/tidyverse/readr/issues/1237 https://github.com/tidyverse/readr/issues/1350
readr 1e tolerates gratuitous whitespace when parsing quoted fields, which can be nice for human readability. But readr 2e, i.e. vroom, does not. Do we want to do anything about that?
library(readr)
library(vroom)
with_edition(1,
read_csv('
"X1", "X2"
"a,b", "c,d"',
show_col_types = FALSE
)
)
#> # A tibble: 1 × 2
#> X1 X2
#> <chr> <chr>
#> 1 a,b c,d
with_edition(2,
read_csv('
"X1", "X2"
"a,b", "c,d"',
show_col_types = FALSE
)
)
#> Warning: One or more parsing issues, see `problems()` for details
#> # A tibble: 1 × 2
#> X1 X2
#> <chr> <chr>
#> 1 a "b\", \"c,d\""
vroom(I('
"X1", "X2"
"a,b", "c,d"'),
show_col_types = FALSE
)
#> Warning: One or more parsing issues, see `problems()` for details
#> # A tibble: 1 × 2
#> X1 X2
#> <chr> <chr>
#> 1 a "b\", \"c,d\""
Created on 2022-01-26 by the reprex package (v2.0.1.9000)
I tried to find out if there are any standards that provide guidance on the right thing to do here.
Of course there is no real standard for CSV files, but this is the de facto one:
https://datatracker.ietf.org/doc/html/rfc4180
It seems to be rather silent on the matter of how whitespace and quoted fields interact.
This Library of Congress page is useful:
https://www.loc.gov/preservation/digital/formats/fdd/fdd000323.shtml
Several relatively common variations from the strict form specified by RFC 4180 are found and may be supported by software tools ....
The treatment of whitespace adjacent to field and record separators varies among applications. If whitespace at the beginning and end of a textual field value is significant, the text string should be text-qualified, i.e. enclosed in quotes.
Still not a definitive answer, but definitely compatible with the position that, if you've got "text-qualified" or quoted fields, you should not have extraneous whitespace in the file, such as adjacent to the delimiters.
Consider this file:
x ,y
"a,b", "c,d"
If I upload it to CSV Lint, I see:
Error Structural problem: Unexpected whitespace on row 2
"a,b", "c,d" Quoted columns in the CSV should not have any leading or trailing whitespace. Remove any spaces, tabs or other whitespace from either side of the delimiters in the row.
http://csvlint.io/validation/61f1d9b415cdf0000400004f
More evidence that would support leaving this as #wontfix.