Improve handling of multiline strings
Hi, I was recently reading a table into R using the readr package but discovered that several rows were missing as some cells included a single double quotation mark.
I managed to resolve this by setting quote = "", however I wonder if the read_tsv()/related functions could be updated to at least raise a warning if improper quoting is discovered in the data? The data.table package both raises a warning and deals with this automatically, as per the following reprex.
Many thanks for considering!
library(tibble)
library(readr)
library(data.table)
# create example file
df <- tribble(
~code, ~description,
"1", '"Single quotes',
"2", "No quotes",
"3", 'Single quotes"',
"4", '"Pair of quotes"',
"5", "No quotes"
)
file_path <- tempfile(pattern = "df_with_quotes", fileext = ".tsv")
write_tsv(df,
file_path,
quote = "none",
escape = "none"
)
# `readr::read_tsv()` reads inappropriately without raising an error with
# default settings
read_tsv(file_path,
quote = "\""
)
#> Rows: 3 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (1): description
#> dbl (1): code
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 3 × 2
#> code description
#> <dbl> <chr>
#> 1 1 "Single quotes\n2\tNo quotes\n3\tSingle quotes"
#> 2 4 "Pair of quotes"
#> 3 5 "No quotes"
# To read correctly use the `quote` argument
read_tsv(file_path,
quote = ""
)
#> Rows: 5 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (1): description
#> dbl (1): code
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 5 × 2
#> code description
#> <dbl> <chr>
#> 1 1 "\"Single quotes"
#> 2 2 "No quotes"
#> 3 3 "Single quotes\""
#> 4 4 "\"Pair of quotes\""
#> 5 5 "No quotes"
# `data.table::fread()` raises a warning but reads the table correctly with
# default settings
fread(file_path,
quote = "\""
)
#> Warning in fread(file_path, quote = "\""): Found and resolved improper quoting
#> in first 100 rows. If the fields are not quoted (e.g. field separator does not
#> appear within any field), try quote="" to avoid this warning.
#> code description
#> 1: 1 "Single quotes
#> 2: 2 No quotes
#> 3: 3 Single quotes"
#> 4: 4 Pair of quotes
#> 5: 5 No quotes
# setting `quote=""` produces (almost) the same result, now without a warning
fread(file_path,
quote = ""
)
#> code description
#> 1: 1 "Single quotes
#> 2: 2 No quotes
#> 3: 3 Single quotes"
#> 4: 4 "Pair of quotes"
#> 5: 5 No quotes
sessionInfo()
#> R version 4.2.0 (2022-04-22)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur/Monterey 10.16
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] data.table_1.14.6 readr_2.1.4 tibble_3.1.8
#>
#> loaded via a namespace (and not attached):
#> [1] pillar_1.8.1 compiler_4.2.0 R.methodsS3_1.8.2 R.utils_2.12.2
#> [5] tools_4.2.0 digest_0.6.31 bit_4.0.5 evaluate_0.20
#> [9] lifecycle_1.0.3 R.cache_0.16.0 pkgconfig_2.0.3 rlang_1.0.6
#> [13] reprex_2.0.2 cli_3.6.0 rstudioapi_0.14 parallel_4.2.0
#> [17] yaml_2.3.7 xfun_0.37 fastmap_1.1.0 withr_2.5.0
#> [21] styler_1.9.1 knitr_1.42 fs_1.6.1 vctrs_0.5.2
#> [25] hms_1.1.2 tidyselect_1.2.0 bit64_4.0.5 glue_1.6.2
#> [29] R6_2.5.1 fansi_1.0.4 vroom_1.6.1 rmarkdown_2.20
#> [33] purrr_1.0.1 tzdb_0.3.0 magrittr_2.0.3 htmltools_0.5.4
#> [37] ellipsis_0.3.2 utf8_1.2.3 crayon_1.5.2 R.oo_1.25.0
I experienced the same issue with read_csv(). A single double quote preceding one of the values omitted the remaining records. Only the warning from fread() saved me:
Here's a somewhat more minimal reprex:
library(readr)
lines <- 'code,description
1,"x
2,-
3,y"'
path <- tempfile()
writeLines(lines, path)
# Allows quoted string to span multiple lines
read_csv(path, col_types = list())
#> # A tibble: 1 × 2
#> code description
#> <dbl> <chr>
#> 1 1 "x\n2,-\n3,y"
# OK: explicit quote works
read_csv(path, quote = "", col_types = list())
#> # A tibble: 3 × 2
#> code description
#> <dbl> <chr>
#> 1 1 "\"x"
#> 2 2 "-"
#> 3 3 "y\""
# warns & treats lines as quoted
data.table::fread(path, quote = "\"")
#> Warning in data.table::fread(path, quote = "\""): Found and resolved improper
#> quoting in first 100 rows. If the fields are not quoted (e.g. field separator
#> does not appear within any field), try quote="" to avoid this warning.
#> code description
#> 1: 1 "x
#> 2: 2 -
#> 3: 3 y"
Created on 2023-07-31 with reprex v2.0.2