readr icon indicating copy to clipboard operation
readr copied to clipboard

Feature request: in `read_delim()` and associated functions, make `trim_ws = TRUE` also work on ASCII code 160, non-breaking spaces

Open gklarenberg opened this issue 11 months ago • 3 comments

If strings contain non-breaking spaces (ASCII code 160), the argument trim_ws = TRUE in read_csv() (or read_delim()) does not work. This was unexpected to me, as str_trim() from stringr from the tidyverse does.

Reprex:

library(tidyverse)

###### Example with regular spaces ###### 
# Create a vector with strings, spaces represented by ASCII code 32:
x <- c(intToUtf8(c(32, 65, 32, 119, 111, 114, 100)), # leading space
       intToUtf8(c(65, 32, 115, 101, 110, 116, 101, 110, 99, 101, 32))) # trailing space
x
#> [1] " A word"     "A sentence "

# Save as a csv
write_csv(data.frame(x), "reg_spaces")
# Read back in as csv
x2 <- read_csv("reg_spaces")
#> Rows: 2 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): x
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Check: no leading or trailing spaces! :D
x2$x
#> [1] "A word"     "A sentence"

# Check other functions:
trimws(x) # Works
#> [1] "A word"     "A sentence" 
str_trim(x) # Works
#> [1] "A word"     "A sentence" 

###### Example with non-breaking spaces ###### 
# Create a vector with strings, spaces represented by ASCII code 160 (non-breaking spaces):
y <- c(intToUtf8(c(160, 65, 32, 119, 111, 114, 100)), 
       intToUtf8(c(65, 32, 115, 101, 110, 116, 101, 110, 99, 101, 160)))
y
#> [1] " A word"     "A sentence "

# Write out as a csv and read back in:
write_csv(data.frame(y), "nonbreak_spaces")
y2 <- read_csv("nonbreak_spaces")
#> Rows: 2 Columns: 1
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): y
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Check: still has leading and trailing spaces... :(
y2$y
#> [1] " A word"     "A sentence "

# Check other functions:
trimws(y) # Does not work
#> [1] " A word"     "A sentence " 
str_trim(y) # Works!
#> [1] "A word"     "A sentence" 

IRL situation: copied text from a website (list of countries separated by commas) to an Excel (csv) spreadsheet, applied "Text to table", using a comma as the separator, to place each country name in a separate column. I ignored the leading white spaces, assuming read_csv() would take care of it, but it did not. After some research, it appears that the csv kept the non-breaking spaces from the website (?), and read_csv() does not remove these.

Looking at the underlying code, I think the parse.ccp code (in the function parse_vector_) could be adjusted to explicitly remove leading and trailing non-breaking spaces. Or it could be added to the header file Token.h: in lines 119 and 121, add \u00A0 as white spaces to remove, in addition to ' ' and '\t'.

gklarenberg avatar Jan 09 '25 04:01 gklarenberg

readr has outsourced the reading and parsing to vroom, so the C++ function parse_vector_ is actually no longer used (except when type_convert() and friends are used, or explicitly calling the v1 engine).

It can be reproduced with vroom.

vroom::vroom(I(paste0("v1\tv2\n",
                      intToUtf8(c(160, 65, 32, 119, 111, 114, 100)),
                      "\tb\n1.0\t2.0\n")), trim_ws = TRUE)$v1
#> Rows: 2 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (2): v1, v2
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> [1] " A word" "1.0"

vroom::vroom(I(paste0("v1\tv2\n",
                      intToUtf8(c(32, 65, 32, 119, 111, 114, 100)),
                      "\tb\n1.0\t2.0\n")), trim_ws = TRUE)$v1
#> Rows: 2 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (2): v1, v2
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> [1] "A word" "1.0"

Created on 2025-04-29 with reprex v2.1.1

And it's due to the white space trimming function can't detect \xa0

https://github.com/tidyverse/vroom/blob/73c90c4fe490c0588b20ac527c40fcb1c683683e/src/utils.h#L228-L240

chainsawriot avatar Apr 29 '25 15:04 chainsawriot

Just wanted to add that the POSIX regex [:space:] does not include \xa0.

(text160 <- intToUtf8(c(160, 65, 32, 119, 111, 114, 100)))
#> [1] " A word"

(text32 <- intToUtf8(c(32, 65, 32, 119, 111, 114, 100)))
#> [1] " A word"

gsub("^[[:space:]]+", "", text160)
#> [1] " A word"

gsub("^[[:space:]]+", "", text32)
#> [1] "A word"

gsub("^[\\h\\s]+", "", text160, perl = TRUE)
#> [1] "A word"

gsub("^[\\h\\s]+", "", text32, perl = TRUE)
#> [1] "A word"

stringr::str_trim(text160, "left")
#> [1] "A word"

stringr::str_trim(text32, "left")
#> [1] "A word"

## under the hood

stringi::stri_trim_left(text160, pattern = "\\P{Wspace}", negate = FALSE)
#> [1] "A word"

stringi::stri_trim_left(text32, pattern = "\\P{Wspace}", negate = FALSE)
#> [1] "A word"

Created on 2025-04-29 with reprex v2.1.1

Reference:

https://perldoc.perl.org/perlrecharclass#Whitespace

chainsawriot avatar Apr 29 '25 16:04 chainsawriot

And this is how data.table handles it...

data.table::fread(I(paste0("v1\tv2\n",
                      intToUtf8(c(160, 65, 32, 119, 111, 114, 100)),
                      "\tb\n1.0\t2.0\n")), strip.white = TRUE)$v1
#> [1] " A word" "1.0"

data.table::fread(I(paste0("v1\tv2\n",
                      intToUtf8(c(32, 65, 32, 119, 111, 114, 100)),
                      "\tb\n1.0\t2.0\n")), strip.white = TRUE)$v1
#> [1] "A word" "1.0"

Created on 2025-04-29 with reprex v2.1.1

C Code: https://github.com/Rdatatable/data.table/blob/017fad886848a6a3e460d3f0be9acdfe609a9f0e/src/fread.c#L502

chainsawriot avatar Apr 29 '25 16:04 chainsawriot