repr
repr copied to clipboard
broken unicode uses <U+884C> which needs to be escaped
@takluyver said in https://github.com/IRkernel/IRkernel/pull/224#issuecomment-161316782:
I ran into another unicode issue while testing this. If R thinks it can't display a character, it escapes it like this:
<U+884C>(vs Python style\u884c). These sequences are being included raw in the HTML repr produces, so the browser tries to interpret them as HTML tags and doesn't show anything. repr should probably be escaping strings for the HTML representation.
please tell me what makes R output this.
probably a good idea to html-encode all character arrays before repr_htmling them, but still…
I came across it on Windows - e.g. by trying print("行政法") in a notebook. I would assume that R tries to determine what encoding the system uses, and if that encoding can't handle the code point in question, it escapes it to the <U+884C> format.
Ideally, our output should bypass that unicode escaping and just send the real unicode code points. But either way, strings in the HTML output need to be HTML escaped so that you can use <, > and & in strings and have them display correctly.
See also https://github.com/IRkernel/repr/issues/28
Ok, this works in RStudio:
> print("行政法")
[1] "行政法"
but not in the notebook:
> print("行政法")
[1] "<U+884C><U+653F><U+6CD5>"

works for me. also we send encoding now. hmm. are you sure this happens with newest everything?
Still happening with the newest everything...
@flying-sheep You are on a non-windows system?
Found this blog post mentioning the problem, but haven't looked deep enough to understand what's going on... https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/
You are on a non-windows system?
Yeh
jikes:
x = "A行政法ß"
nchar(x)
x
26
"A<U+884C><U+653F><U+6CD5>ß"
My interpetation is that the string is already wrong when it comes in?
Even clearer:
"法\u8FDB"
Produces this:
"<U+6CD5>进"
My current guess is that this is happening in evaluate -> see last element...
Input cell:
x = "法"
y = "\u8FDB"
nchar(x)
nchar(y)
x
y
print(x)
print(y)
Output in the notebook:
8
1
"<U+6CD5>"
"进"
[1] "<U+6CD5>"
[1] "<U+8FDB>"
Using the IRkernel/IRkernel#293, this is what ends up in the file log:
2016-04-10 22:26:10 DEBUG: main loop: after poll
2016-04-10 22:26:10 DEBUG: main loop: shell
2016-04-10 22:26:10 DEBUG: Sending msg status
2016-04-10 22:26:10 DEBUG: Sending msg execute_input
2016-04-10 22:26:10 DEBUG: Executing code: x = "法"
y = "\u8FDB"
nchar(x)
nchar(y)
x
y
print(x)
print(y)
2016-04-10 22:26:10 DEBUG: Sending display_data: List of 4
$ text/plain : chr "[1] 8"
$ text/html : chr "8"
$ text/markdown: chr "8"
$ text/latex : chr "8"
2016-04-10 22:26:10 DEBUG: Sending msg display_data
2016-04-10 22:26:10 DEBUG: Sending display_data: List of 4
$ text/plain : chr "[1] 1"
$ text/html : chr "1"
$ text/markdown: chr "1"
$ text/latex : chr "1"
2016-04-10 22:26:10 DEBUG: Sending msg display_data
2016-04-10 22:26:10 DEBUG: Sending display_data: List of 4
$ text/plain : chr "[1] \"<U+6CD5>\""
$ text/html : chr "\"<U+6CD5>\""
$ text/markdown: chr "\"<U+6CD5>\""
$ text/latex : chr "\"<U+6CD5>\""
2016-04-10 22:26:10 DEBUG: Sending msg display_data
2016-04-10 22:26:10 DEBUG: Sending display_data: List of 4
$ text/plain : chr "[1] \"<U+8FDB>\""
$ text/html : chr "\"<U+8FDB>\"""| __truncated__
$ text/markdown: chr "\"<U+8FDB>\"""| __truncated__
$ text/latex : chr "\"<U+8FDB>\"""| __truncated__
2016-04-10 22:26:10 DEBUG: Sending msg display_data
2016-04-10 22:26:10 DEBUG: Stream output: [1] "<U+6CD5>"
2016-04-10 22:26:10 DEBUG: Sending msg stream
2016-04-10 22:26:10 DEBUG: Stream output: [1] "<U+8FDB>"
2016-04-10 22:26:10 DEBUG: Sending msg stream
2016-04-10 22:26:10 DEBUG: Sending msg status
2016-04-10 22:26:10 DEBUG: Sending msg execute_reply
2016-04-10 22:26:10 DEBUG: main loop: beginning
Guess:
8 # it's fine when it comes from zmq (see log), but it's already screwed up when it gets executed
1 # evaluate parses the unicode escape to a single value -> everything is fine
"<U+6CD5>" # dito above
"进" # printing in the context of the kernel of a returned value is ok
[1] "<U+6CD5>" # no change...
[1] "<U+8FDB>" # but printing in evaluate will screw up the unicode again
So it looks like evalue needs some encoding, both in and out?
https://stat.ethz.ch/R-manual/R-devel/library/base/html/source.html -> Encoding section
This what I get on my windows R:
> localeToCharset()
[1] "ISO8859-1"
And this is what I get on my NAS (linux based, hadleyverse docker image):
> localeToCharset()
[1] "UTF-8" "ISO8859-1"
And here is an example of the evaluate problem (both executed in an RStudio window...):
library(evaluate)
code <- "
x = '法'
y = '\\u8FDB'
print(nchar(x))
print(nchar(y))
print(x)
print(y)
"
l = list()
txt <- function(o, type) {
t <- paste(o, collapse = '\n')
l[length(l)+1] <<- t
}
oh <- new_output_handler(source = identity,
text = function(o) txt(o, "text"),
graphics = identity,
message = identity,
warning = identity,
error = identity,
value = identity)
x <- evaluate(code, output_handler = oh)
l
Windows:
> Encoding(code)
[1] "UTF-8"
> parse(text=code)
expression(x = '<U+6CD5>', y = '\u8FDB', print(nchar(x)), print(nchar(y)),
print(x), print(y))
> l
[[1]]
[1] "[1] 8\n" #> bad in
[[2]]
[1] "[1] 1\n" # ok if escaped...
[[3]]
[1] "[1] \"<U+6CD5>\"\n" # -> Just the bad in
[[4]]
[1] "[1] \"<U+8FDB>\"\n" # -> but here it's bad out...
Linux (NAS):
> Encoding(code)
[1] "UTF-8"
> parse(text=code)
expression(x = '法', y = '\u8FDB', print(nchar(x)), print(nchar(y)),
print(x), print(y))
> l
[[1]]
[1] "[1] 1\n"
[[2]]
[1] "[1] 1\n"
[[3]]
[1] "[1] \"法\"\n"
[[4]]
[1] "[1] \"进\"\n"
And even further down for the input problem:
Windows:
> parse(text='"法 \\u8FDB"')
expression("<U+6CD5> \u8FDB")
Linux:
> parse(text='"法 \\u8FDB"')
expression("法 \u8FDB")
If someone wants to have fun: c sources of parse: https://github.com/wch/r-source/blob/e5b21d0397c607883ff25cca379687b86933d730/src/main/source.c#L193
I tried to set my locale, but everything I tired was rejected by Sys.setlocales(...).
thanks for digging into this. i think you were almost there. parse has an encoding argument.
i filed hadley/evaluate#66.
depending on how it is resolved (automatic/manually) we might need to extract and specify the encoding when calling a fixed/enhanced version of evaluate or not.
I tried that argument and it didn't make any difference :-(
I updated hadley/evaluate#66 with code examples which demonstrate what goes wrong here...
Current status here: it's an ustream bug and we have some workarounds (warn if unicode input and don't send the eclipse char on such systems. So not a blocker for the next release IMO -> restor teh milestone if you have a different opinion...]
But HTML output is now being escaped, right? So you can at least see <U+884C>?
But HTML output is now being escaped, right? So you can at least see <U+884C>?
Yes and no: yes because html is escaped and no, because of https://github.com/IRkernel/repr/pull/43 I see three dots (=3 chars).
But "OUT" is not the problem: you always see something, it's just escaped in the funny <U+xxxx> and therefore not C&P-able... "IN" is the bigger problem, but that was taken care of in https://github.com/IRkernel/IRkernel/pull/296
Since R 4.2, it has support for UTF-8 support in windows. Anything one needs to do there or will it just work?