evaluate Characters garbled from sink() on Windows

Some examples:

Sys.setlocale(, 'English')  # can also try 'German_Austria'
# [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
evaluate::evaluate("'\u0161'")
# [[1]]
# $src
# [1] "'š'"
# 
# attr(,"class")
# [1] "source"
# 
# [[2]]
# [1] "[1] \"\u009a\"\n"

Sys.setlocale(, 'Chinese')
# [1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936"
evaluate::evaluate("'\u0161'")
# [[1]]
# $src
# [1] "'š'"
# 
# attr(,"class")
# [1] "source"
# 
# [[2]]
# [1] "[1] \"<U+0161>\"\n"

Originally reported at http://stackoverflow.com/q/34096239/559676

With only sink() and textConnection():

sink_test = function(locale = 'English') {
  Sys.setlocale(, locale)
  x = '\u0161'
  y = character()
  con = textConnection('y', local = TRUE, open = 'wr')
  sink(con)
  print(x)
  sink()
  y  
}

sink_test()
# [1] "[1] \"歕""

The problem with this reduced example is only the wrong encoding marked:

z = sink_test()
Encoding(z)
# [1] "latin1"

iconv(z, to = 'UTF-8')
# [1] "[1] \"š\""

Dec 12 '15 05:12 yihui

I found this issue on investigating https://github.com/hadley/emo/issues/7.

Emojis still fail to keep their characters with sink_test().

sink_test = function(locale = 'English') {
  Sys.setlocale(, locale)
  x = emo::ji('japanese_goblin')
  y = character()
  con = textConnection('y', local = TRUE, open = 'wr')
  sink(con)
  print(x)
  sink()
  y  
}
#> [1] "<f0><U+009F><U+0091><U+00BA> "

Apparently, we need better sink(), which has some good option like useBytes in writeLines(). But I see little hope...

output <- character(0L)
outputCon <- textConnection('output', 'wr')
writeLines(emo::ji('japanese_goblin'), outputCon, useBytes = TRUE)
close(outputCon)
output
#> [1] "村"
`Encoding<-`(output, 'UTF-8')
#> [1] "\xf0\u009f\u0091�"
cat(`Encoding<-`(output, 'UTF-8'))
#> 👺

May 13 '17 14:05 yutannihilation

I think base R needs better support for UTF-8. I'm counting on @krlmlr to save the world: http://r.789695.n4.nabble.com/source-parse-and-foreign-UTF-8-characters-td4733523.html

May 13 '17 16:05 yihui

Working on it with @dmurdoch ;-)

May 13 '17 16:05 krlmlr

Oh, @krlmlr, you are always our UTF-8 hero! Cool. Thanks for the information 👍

May 13 '17 20:05 yutannihilation

Not sure but perhaps this is also related https://github.com/tidyverse/readr/issues/884

Sep 19 '18 00:09 vnijs

No, I'm quite sure it's not. In that case, R does things right, but boost won't :(

Sep 19 '18 00:09 yutannihilation

FWIW I filed a bug report with R and unfortunately it sounds like it will be too expensive for them to fix: https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17503

Nov 15 '18 17:11 kevinushey

Thanks @kevinushey! Then I wonder if it is possible to write a custom connection that supports UTF-8 instead of the native encoding. I have no idea about how connections in R work, but I remember Simon Urbanek gave a talk in 2013, in which he showed a custom connection based on 0MQ: https://github.com/s-u/zmqc

Nov 16 '18 16:11 yihui

It seems that strings are translated by r-base into native even before they reach the connection. Perhaps we really require a fix in base for sink(), but I'm not sure.

Perhaps Windows will support UTF-8 as native encoding at some point. The "April 2018 insider build" of Windows seems to have some of it: https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8

Nov 16 '18 16:11 krlmlr

I see. If base R does the translation, I guess there is nothing we can do about it. That is really unfortunate...

Nov 16 '18 17:11 yihui

Closing since recent R should handle this much better on windows. If it's still a problem for folks, please let me know and we can try implementing something like r-lib/testthat#1693.

Jun 14 '24 18:06 hadley