readr icon indicating copy to clipboard operation
readr copied to clipboard

Improve handling of multiline strings

Open rmgpanw opened this issue 2 years ago • 2 comments

Hi, I was recently reading a table into R using the readr package but discovered that several rows were missing as some cells included a single double quotation mark.

I managed to resolve this by setting quote = "", however I wonder if the read_tsv()/related functions could be updated to at least raise a warning if improper quoting is discovered in the data? The data.table package both raises a warning and deals with this automatically, as per the following reprex.

Many thanks for considering!

library(tibble)
library(readr)
library(data.table)

# create example file
df <- tribble(
  ~code, ~description,
  "1", '"Single quotes',
  "2", "No quotes",
  "3", 'Single quotes"',
  "4", '"Pair of quotes"',
  "5", "No quotes"
)

file_path <- tempfile(pattern = "df_with_quotes", fileext = ".tsv")

write_tsv(df,
  file_path,
  quote = "none",
  escape = "none"
)

# `readr::read_tsv()` reads inappropriately without raising an error with
# default settings
read_tsv(file_path,
  quote = "\""
)
#> Rows: 3 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (1): description
#> dbl (1): code
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 3 × 2
#>    code description                                    
#>   <dbl> <chr>                                          
#> 1     1 "Single quotes\n2\tNo quotes\n3\tSingle quotes"
#> 2     4 "Pair of quotes"                               
#> 3     5 "No quotes"

# To read correctly use the `quote` argument
read_tsv(file_path,
  quote = ""
)
#> Rows: 5 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (1): description
#> dbl (1): code
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 5 × 2
#>    code description         
#>   <dbl> <chr>               
#> 1     1 "\"Single quotes"   
#> 2     2 "No quotes"         
#> 3     3 "Single quotes\""   
#> 4     4 "\"Pair of quotes\""
#> 5     5 "No quotes"

# `data.table::fread()` raises a warning but reads the table correctly with
# default settings
fread(file_path,
  quote = "\""
)
#> Warning in fread(file_path, quote = "\""): Found and resolved improper quoting
#> in first 100 rows. If the fields are not quoted (e.g. field separator does not
#> appear within any field), try quote="" to avoid this warning.
#>    code    description
#> 1:    1 "Single quotes
#> 2:    2      No quotes
#> 3:    3 Single quotes"
#> 4:    4 Pair of quotes
#> 5:    5      No quotes

# setting `quote=""` produces (almost) the same result, now without a warning
fread(file_path,
  quote = ""
)
#>    code      description
#> 1:    1   "Single quotes
#> 2:    2        No quotes
#> 3:    3   Single quotes"
#> 4:    4 "Pair of quotes"
#> 5:    5        No quotes

sessionInfo()
#> R version 4.2.0 (2022-04-22)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur/Monterey 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.14.6 readr_2.1.4       tibble_3.1.8     
#> 
#> loaded via a namespace (and not attached):
#>  [1] pillar_1.8.1      compiler_4.2.0    R.methodsS3_1.8.2 R.utils_2.12.2   
#>  [5] tools_4.2.0       digest_0.6.31     bit_4.0.5         evaluate_0.20    
#>  [9] lifecycle_1.0.3   R.cache_0.16.0    pkgconfig_2.0.3   rlang_1.0.6      
#> [13] reprex_2.0.2      cli_3.6.0         rstudioapi_0.14   parallel_4.2.0   
#> [17] yaml_2.3.7        xfun_0.37         fastmap_1.1.0     withr_2.5.0      
#> [21] styler_1.9.1      knitr_1.42        fs_1.6.1          vctrs_0.5.2      
#> [25] hms_1.1.2         tidyselect_1.2.0  bit64_4.0.5       glue_1.6.2       
#> [29] R6_2.5.1          fansi_1.0.4       vroom_1.6.1       rmarkdown_2.20   
#> [33] purrr_1.0.1       tzdb_0.3.0        magrittr_2.0.3    htmltools_0.5.4  
#> [37] ellipsis_0.3.2    utf8_1.2.3        crayon_1.5.2      R.oo_1.25.0

rmgpanw avatar Apr 17 '23 14:04 rmgpanw

I experienced the same issue with read_csv(). A single double quote preceding one of the values omitted the remaining records. Only the warning from fread() saved me:

image

daranzolin avatar Jul 19 '23 23:07 daranzolin

Here's a somewhat more minimal reprex:

library(readr)

lines <- 'code,description
1,"x
2,-
3,y"'
path <- tempfile()
writeLines(lines, path)

# Allows quoted string to span multiple lines
read_csv(path, col_types = list())
#> # A tibble: 1 × 2
#>    code description  
#>   <dbl> <chr>        
#> 1     1 "x\n2,-\n3,y"

# OK: explicit quote works
read_csv(path, quote = "", col_types = list())
#> # A tibble: 3 × 2
#>    code description
#>   <dbl> <chr>      
#> 1     1 "\"x"      
#> 2     2 "-"        
#> 3     3 "y\""

# warns & treats lines as quoted
data.table::fread(path, quote = "\"")
#> Warning in data.table::fread(path, quote = "\""): Found and resolved improper
#> quoting in first 100 rows. If the fields are not quoted (e.g. field separator
#> does not appear within any field), try quote="" to avoid this warning.
#>    code description
#> 1:    1          "x
#> 2:    2           -
#> 3:    3          y"

Created on 2023-07-31 with reprex v2.0.2

hadley avatar Jul 31 '23 21:07 hadley