rencfaq icon indicating copy to clipboard operation
rencfaq copied to clipboard

Byte order marks

Open jennybc opened this issue 4 years ago • 3 comments
trafficstars

I recently got to spend some quality time with my best friend charToRaw(), courtesy of a byte order mark 😬

I was doing a round trip like so:

local plain text file --> upload to Google Drive & convert to a Google Doc --> export from Google Drive as text/plain --> read into memory in R --> parse back to character vector

While developing a test I see:

>   expect_setequal(
+     chicken_poem,
+     readLines(drive_example("chicken.txt"))
+   )
Error: `chicken_poem`[1] absent from readLines(drive_example("chicken.txt"))
> chicken_poem[[1]]
[1] "A chicken whose name was Chantecler"
> readLines(drive_example("chicken.txt"))[[1]]
[1] "A chicken whose name was Chantecler"
> chicken_poem[[1]] == readLines(drive_example("chicken.txt"))[[1]]
[1] FALSE
> Encoding(chicken_poem[[1]])
[1] "UTF-8"
> Encoding(readLines(drive_example("chicken.txt"))[[1]])
[1] "unknown"
> charToRaw(chicken_poem[[1]])
 [1] ef bb bf 41 20 63 68 69 63 6b 65 6e 20 77 68 6f 73 65 20 6e 61 6d 65 20 77 61
[27] 73 20 43 68 61 6e 74 65 63 6c 65 72
> charToRaw(readLines(drive_example("chicken.txt"))[[1]])
 [1] 41 20 63 68 69 63 6b 65 6e 20 77 68 6f 73 65 20 6e 61 6d 65 20 77 61 73 20 43
[27] 68 61 6e 74 65 63 6c 65 72

And thus I found the BOM on the text returning from the round trip.

Do you have anything to say about ... when you're likely to encounter BOMs? Should you get rid of them? If so, how? Or can you compare two strings in a way that ignores them?

jennybc avatar Jun 23 '21 21:06 jennybc

Yeah, that is tricky. UTF-8 of course does not need a BOM because it is byte order independent.

Some tools, however, use \xef\xbb\xbf to mark a plain text file as UTF-8. E.g. Microsoft tools like to do that. Some of them also require it at the beginning of a text file.

It is a really tough question what to do with it in R, because R does not need it, in fact it messes up all R functions:

❯ x <- paste0("\xef\xbb\xbfword ", "\u30de")
❯ Encoding(x)
[1] "UTF-8"

❯ x
[1] "word マ"

❯ nchar(x)
[1] 7

❯ substr(x, 1, 4)
[1] "wor"

❯ grepl("^word", x)
[1] FALSE

Why pasting strings with "unknown" and "UTF-8" encodings will mark the result as "UTF-8" I am not sure. But the the real weird stuff is that nchar(), substr() and grepl() are all wrong, because they consider the BOM as part of the string.

So yes, ideally you would remove the BOM when manipulating the strings in R.

OTOH, if you are downloading a file from Google Drive that you would use in some (MS) tool later, then you'd want to keep it, otherwise that tool might not be able to read in the file.

I am not sure what the right solution is here. I am afraid that if you want to handle all use cases, then you'd need to make BOM handling explicit when downloading text files from Google Drive. E.g. have an option and/or function argument for it. Maybe the default of the option could be to remove it, and mark the string as UTF-8.

gaborcsardi avatar Jul 02 '21 12:07 gaborcsardi

I think the suggestion for the R Encoding FAQ, then, is just to create awareness of the potential for these marks to exist.

When two strings look the same, but clearly are not the same, as usual ,charToRaw() is your friend and a BOM is one of the specific things to be looking for.

jennybc avatar Jul 02 '21 16:07 jennybc

FWIW readr / vroom have code to skip the byte order marks at https://github.com/r-lib/vroom/blob/b3ba15212978253174c9f99f1098799cca9a6f74/src/utils.h#L215-L266, since they are pretty common in CSV's created using Microsoft programs.

jimhester avatar Oct 01 '21 15:10 jimhester