vroom icon indicating copy to clipboard operation
vroom copied to clipboard

Extraneous whitespace complicates parsing in the presence of quoted fields

Open jennybc opened this issue 2 years ago • 2 comments

Based on issues originally reported in readr: https://github.com/tidyverse/readr/issues/1237 https://github.com/tidyverse/readr/issues/1350

readr 1e tolerates gratuitous whitespace when parsing quoted fields, which can be nice for human readability. But readr 2e, i.e. vroom, does not. Do we want to do anything about that?

library(readr)
library(vroom)

with_edition(1,
  read_csv('
     "X1",  "X2"
    "a,b", "c,d"',
  show_col_types = FALSE
  )
)
#> # A tibble: 1 × 2
#>   X1    X2   
#>   <chr> <chr>
#> 1 a,b   c,d

with_edition(2,
  read_csv('
     "X1",  "X2"
    "a,b", "c,d"',
  show_col_types = FALSE
  )
)
#> Warning: One or more parsing issues, see `problems()` for details
#> # A tibble: 1 × 2
#>   X1    X2            
#>   <chr> <chr>         
#> 1 a     "b\", \"c,d\""

vroom(I('
   "X1",  "X2"
  "a,b", "c,d"'),
  show_col_types = FALSE
)
#> Warning: One or more parsing issues, see `problems()` for details
#> # A tibble: 1 × 2
#>   X1    X2            
#>   <chr> <chr>         
#> 1 a     "b\", \"c,d\""

Created on 2022-01-26 by the reprex package (v2.0.1.9000)

jennybc avatar Jan 26 '22 19:01 jennybc

I tried to find out if there are any standards that provide guidance on the right thing to do here.

Of course there is no real standard for CSV files, but this is the de facto one:

https://datatracker.ietf.org/doc/html/rfc4180

It seems to be rather silent on the matter of how whitespace and quoted fields interact.

This Library of Congress page is useful:

https://www.loc.gov/preservation/digital/formats/fdd/fdd000323.shtml

Several relatively common variations from the strict form specified by RFC 4180 are found and may be supported by software tools ....

The treatment of whitespace adjacent to field and record separators varies among applications. If whitespace at the beginning and end of a textual field value is significant, the text string should be text-qualified, i.e. enclosed in quotes.

Still not a definitive answer, but definitely compatible with the position that, if you've got "text-qualified" or quoted fields, you should not have extraneous whitespace in the file, such as adjacent to the delimiters.

jennybc avatar Jan 26 '22 23:01 jennybc

Consider this file:

x ,y
"a,b", "c,d"

If I upload it to CSV Lint, I see:

Error Structural problem: Unexpected whitespace on row 2

"a,b", "c,d" Quoted columns in the CSV should not have any leading or trailing whitespace. Remove any spaces, tabs or other whitespace from either side of the delimiters in the row.

http://csvlint.io/validation/61f1d9b415cdf0000400004f

More evidence that would support leaving this as #wontfix.

jennybc avatar Jan 26 '22 23:01 jennybc