Characters garbled from sink() on Windows
Some examples:
Sys.setlocale(, 'English') # can also try 'German_Austria'
# [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
evaluate::evaluate("'\u0161'")
# [[1]]
# $src
# [1] "'š'"
#
# attr(,"class")
# [1] "source"
#
# [[2]]
# [1] "[1] \"\u009a\"\n"
Sys.setlocale(, 'Chinese')
# [1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936"
evaluate::evaluate("'\u0161'")
# [[1]]
# $src
# [1] "'š'"
#
# attr(,"class")
# [1] "source"
#
# [[2]]
# [1] "[1] \"<U+0161>\"\n"
Originally reported at http://stackoverflow.com/q/34096239/559676
With only sink() and textConnection():
sink_test = function(locale = 'English') {
Sys.setlocale(, locale)
x = '\u0161'
y = character()
con = textConnection('y', local = TRUE, open = 'wr')
sink(con)
print(x)
sink()
y
}
sink_test()
# [1] "[1] \"歕""
The problem with this reduced example is only the wrong encoding marked:
z = sink_test()
Encoding(z)
# [1] "latin1"
iconv(z, to = 'UTF-8')
# [1] "[1] \"š\""
I found this issue on investigating https://github.com/hadley/emo/issues/7.
Emojis still fail to keep their characters with sink_test().
sink_test = function(locale = 'English') {
Sys.setlocale(, locale)
x = emo::ji('japanese_goblin')
y = character()
con = textConnection('y', local = TRUE, open = 'wr')
sink(con)
print(x)
sink()
y
}
#> [1] "<f0><U+009F><U+0091><U+00BA> "
Apparently, we need better sink(), which has some good option like useBytes in writeLines(). But I see little hope...
output <- character(0L)
outputCon <- textConnection('output', 'wr')
writeLines(emo::ji('japanese_goblin'), outputCon, useBytes = TRUE)
close(outputCon)
output
#> [1] "村"
`Encoding<-`(output, 'UTF-8')
#> [1] "\xf0\u009f\u0091�"
cat(`Encoding<-`(output, 'UTF-8'))
#> 👺
I think base R needs better support for UTF-8. I'm counting on @krlmlr to save the world: http://r.789695.n4.nabble.com/source-parse-and-foreign-UTF-8-characters-td4733523.html
Working on it with @dmurdoch ;-)
Oh, @krlmlr, you are always our UTF-8 hero! Cool. Thanks for the information 👍
Not sure but perhaps this is also related https://github.com/tidyverse/readr/issues/884
No, I'm quite sure it's not. In that case, R does things right, but boost won't :(
FWIW I filed a bug report with R and unfortunately it sounds like it will be too expensive for them to fix: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17503
Thanks @kevinushey! Then I wonder if it is possible to write a custom connection that supports UTF-8 instead of the native encoding. I have no idea about how connections in R work, but I remember Simon Urbanek gave a talk in 2013, in which he showed a custom connection based on 0MQ: https://github.com/s-u/zmqc
It seems that strings are translated by r-base into native even before they reach the connection. Perhaps we really require a fix in base for sink(), but I'm not sure.
Perhaps Windows will support UTF-8 as native encoding at some point. The "April 2018 insider build" of Windows seems to have some of it: https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8
I see. If base R does the translation, I guess there is nothing we can do about it. That is really unfortunate...
Closing since recent R should handle this much better on windows. If it's still a problem for folks, please let me know and we can try implementing something like r-lib/testthat#1693.