Feature request: in `read_delim()` and associated functions, make `trim_ws = TRUE` also work on ASCII code 160, non-breaking spaces
If strings contain non-breaking spaces (ASCII code 160), the argument trim_ws = TRUE in read_csv() (or read_delim()) does not work. This was unexpected to me, as str_trim() from stringr from the tidyverse does.
Reprex:
library(tidyverse)
###### Example with regular spaces ######
# Create a vector with strings, spaces represented by ASCII code 32:
x <- c(intToUtf8(c(32, 65, 32, 119, 111, 114, 100)), # leading space
intToUtf8(c(65, 32, 115, 101, 110, 116, 101, 110, 99, 101, 32))) # trailing space
x
#> [1] " A word" "A sentence "
# Save as a csv
write_csv(data.frame(x), "reg_spaces")
# Read back in as csv
x2 <- read_csv("reg_spaces")
#> Rows: 2 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): x
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check: no leading or trailing spaces! :D
x2$x
#> [1] "A word" "A sentence"
# Check other functions:
trimws(x) # Works
#> [1] "A word" "A sentence"
str_trim(x) # Works
#> [1] "A word" "A sentence"
###### Example with non-breaking spaces ######
# Create a vector with strings, spaces represented by ASCII code 160 (non-breaking spaces):
y <- c(intToUtf8(c(160, 65, 32, 119, 111, 114, 100)),
intToUtf8(c(65, 32, 115, 101, 110, 116, 101, 110, 99, 101, 160)))
y
#> [1] " A word" "A sentence "
# Write out as a csv and read back in:
write_csv(data.frame(y), "nonbreak_spaces")
y2 <- read_csv("nonbreak_spaces")
#> Rows: 2 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): y
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check: still has leading and trailing spaces... :(
y2$y
#> [1] " A word" "A sentence "
# Check other functions:
trimws(y) # Does not work
#> [1] " A word" "A sentence "
str_trim(y) # Works!
#> [1] "A word" "A sentence"
IRL situation: copied text from a website (list of countries separated by commas) to an Excel (csv) spreadsheet, applied "Text to table", using a comma as the separator, to place each country name in a separate column. I ignored the leading white spaces, assuming read_csv() would take care of it, but it did not. After some research, it appears that the csv kept the non-breaking spaces from the website (?), and read_csv() does not remove these.
Looking at the underlying code, I think the parse.ccp code (in the function parse_vector_) could be adjusted to explicitly remove leading and trailing non-breaking spaces. Or it could be added to the header file Token.h: in lines 119 and 121, add \u00A0 as white spaces to remove, in addition to ' ' and '\t'.
readr has outsourced the reading and parsing to vroom, so the C++ function parse_vector_ is actually no longer used (except when type_convert() and friends are used, or explicitly calling the v1 engine).
It can be reproduced with vroom.
vroom::vroom(I(paste0("v1\tv2\n",
intToUtf8(c(160, 65, 32, 119, 111, 114, 100)),
"\tb\n1.0\t2.0\n")), trim_ws = TRUE)$v1
#> Rows: 2 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (2): v1, v2
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> [1] " A word" "1.0"
vroom::vroom(I(paste0("v1\tv2\n",
intToUtf8(c(32, 65, 32, 119, 111, 114, 100)),
"\tb\n1.0\t2.0\n")), trim_ws = TRUE)$v1
#> Rows: 2 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (2): v1, v2
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> [1] "A word" "1.0"
Created on 2025-04-29 with reprex v2.1.1
And it's due to the white space trimming function can't detect \xa0
https://github.com/tidyverse/vroom/blob/73c90c4fe490c0588b20ac527c40fcb1c683683e/src/utils.h#L228-L240
Just wanted to add that the POSIX regex [:space:] does not include \xa0.
(text160 <- intToUtf8(c(160, 65, 32, 119, 111, 114, 100)))
#> [1] " A word"
(text32 <- intToUtf8(c(32, 65, 32, 119, 111, 114, 100)))
#> [1] " A word"
gsub("^[[:space:]]+", "", text160)
#> [1] " A word"
gsub("^[[:space:]]+", "", text32)
#> [1] "A word"
gsub("^[\\h\\s]+", "", text160, perl = TRUE)
#> [1] "A word"
gsub("^[\\h\\s]+", "", text32, perl = TRUE)
#> [1] "A word"
stringr::str_trim(text160, "left")
#> [1] "A word"
stringr::str_trim(text32, "left")
#> [1] "A word"
## under the hood
stringi::stri_trim_left(text160, pattern = "\\P{Wspace}", negate = FALSE)
#> [1] "A word"
stringi::stri_trim_left(text32, pattern = "\\P{Wspace}", negate = FALSE)
#> [1] "A word"
Created on 2025-04-29 with reprex v2.1.1
Reference:
And this is how data.table handles it...
data.table::fread(I(paste0("v1\tv2\n",
intToUtf8(c(160, 65, 32, 119, 111, 114, 100)),
"\tb\n1.0\t2.0\n")), strip.white = TRUE)$v1
#> [1] " A word" "1.0"
data.table::fread(I(paste0("v1\tv2\n",
intToUtf8(c(32, 65, 32, 119, 111, 114, 100)),
"\tb\n1.0\t2.0\n")), strip.white = TRUE)$v1
#> [1] "A word" "1.0"
Created on 2025-04-29 with reprex v2.1.1
C Code: https://github.com/Rdatatable/data.table/blob/017fad886848a6a3e460d3f0be9acdfe609a9f0e/src/fread.c#L502