repr icon indicating copy to clipboard operation
repr copied to clipboard

broken unicode uses <U+884C> which needs to be escaped

Open flying-sheep opened this issue 9 years ago • 21 comments

@takluyver said in https://github.com/IRkernel/IRkernel/pull/224#issuecomment-161316782:

I ran into another unicode issue while testing this. If R thinks it can't display a character, it escapes it like this: <U+884C> (vs Python style \u884c). These sequences are being included raw in the HTML repr produces, so the browser tries to interpret them as HTML tags and doesn't show anything. repr should probably be escaping strings for the HTML representation.

please tell me what makes R output this.

probably a good idea to html-encode all character arrays before repr_htmling them, but still…

flying-sheep avatar Dec 04 '15 15:12 flying-sheep

I came across it on Windows - e.g. by trying print("行政法") in a notebook. I would assume that R tries to determine what encoding the system uses, and if that encoding can't handle the code point in question, it escapes it to the <U+884C> format.

Ideally, our output should bypass that unicode escaping and just send the real unicode code points. But either way, strings in the HTML output need to be HTML escaped so that you can use <, > and & in strings and have them display correctly.

takluyver avatar Dec 04 '15 16:12 takluyver

See also https://github.com/IRkernel/repr/issues/28

jankatins avatar Apr 07 '16 14:04 jankatins

Ok, this works in RStudio:

> print("行政法")
[1] "行政法"

but not in the notebook:

> print("行政法")
[1] "<U+884C><U+653F><U+6CD5>"

jankatins avatar Apr 08 '16 16:04 jankatins

spectacle t23011

works for me. also we send encoding now. hmm. are you sure this happens with newest everything?

flying-sheep avatar Apr 09 '16 14:04 flying-sheep

Still happening with the newest everything...

@flying-sheep You are on a non-windows system?

jankatins avatar Apr 10 '16 10:04 jankatins

Found this blog post mentioning the problem, but haven't looked deep enough to understand what's going on... https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/

jankatins avatar Apr 10 '16 10:04 jankatins

You are on a non-windows system?

Yeh

flying-sheep avatar Apr 10 '16 15:04 flying-sheep

jikes:

x = "A行政法ß"
nchar(x)
x
26
"A<U+884C><U+653F><U+6CD5>ß"

My interpetation is that the string is already wrong when it comes in?

jankatins avatar Apr 10 '16 17:04 jankatins

Even clearer:

"法\u8FDB"

Produces this:

"<U+6CD5>进"

jankatins avatar Apr 10 '16 18:04 jankatins

My current guess is that this is happening in evaluate -> see last element...

Input cell:

x = "法"
y = "\u8FDB"
nchar(x)
nchar(y)
x
y
print(x)
print(y)

Output in the notebook:

8
1
"<U+6CD5>"
"进"
[1] "<U+6CD5>"
[1] "<U+8FDB>"

Using the IRkernel/IRkernel#293, this is what ends up in the file log:

2016-04-10 22:26:10 DEBUG: main loop: after poll
2016-04-10 22:26:10 DEBUG: main loop: shell
2016-04-10 22:26:10 DEBUG: Sending msg status
2016-04-10 22:26:10 DEBUG: Sending msg execute_input
2016-04-10 22:26:10 DEBUG: Executing code: x = "法"
y = "\u8FDB"
nchar(x)
nchar(y)
x
y
print(x)
print(y)
2016-04-10 22:26:10 DEBUG: Sending display_data: List of 4
 $ text/plain   : chr "[1] 8"
 $ text/html    : chr "8"
 $ text/markdown: chr "8"
 $ text/latex   : chr "8"
2016-04-10 22:26:10 DEBUG: Sending msg display_data
2016-04-10 22:26:10 DEBUG: Sending display_data: List of 4
 $ text/plain   : chr "[1] 1"
 $ text/html    : chr "1"
 $ text/markdown: chr "1"
 $ text/latex   : chr "1"
2016-04-10 22:26:10 DEBUG: Sending msg display_data
2016-04-10 22:26:10 DEBUG: Sending display_data: List of 4
 $ text/plain   : chr "[1] \"<U+6CD5>\""
 $ text/html    : chr "\"&lt;U+6CD5&gt;\""
 $ text/markdown: chr "\"&lt;U+6CD5&gt;\""
 $ text/latex   : chr "\"<U+6CD5>\""
2016-04-10 22:26:10 DEBUG: Sending msg display_data
2016-04-10 22:26:10 DEBUG: Sending display_data: List of 4
 $ text/plain   : chr "[1] \"<U+8FDB>\""
 $ text/html    : chr "\"<U+8FDB>\"""| __truncated__
 $ text/markdown: chr "\"<U+8FDB>\"""| __truncated__
 $ text/latex   : chr "\"<U+8FDB>\"""| __truncated__
2016-04-10 22:26:10 DEBUG: Sending msg display_data
2016-04-10 22:26:10 DEBUG: Stream output: [1] "<U+6CD5>"

2016-04-10 22:26:10 DEBUG: Sending msg stream
2016-04-10 22:26:10 DEBUG: Stream output: [1] "<U+8FDB>"

2016-04-10 22:26:10 DEBUG: Sending msg stream
2016-04-10 22:26:10 DEBUG: Sending msg status
2016-04-10 22:26:10 DEBUG: Sending msg execute_reply
2016-04-10 22:26:10 DEBUG: main loop: beginning

Guess:

8 # it's fine when it comes from zmq (see log), but it's already screwed up when it gets executed
1 # evaluate parses the unicode escape to a single value -> everything is fine
"<U+6CD5>" # dito above
"进" # printing in the context of the kernel of a returned value is ok
[1] "<U+6CD5>" # no change...
[1] "<U+8FDB>" # but printing in evaluate will screw up the unicode again

So it looks like evalue needs some encoding, both in and out?

jankatins avatar Apr 10 '16 20:04 jankatins

https://stat.ethz.ch/R-manual/R-devel/library/base/html/source.html -> Encoding section

This what I get on my windows R:

> localeToCharset()
[1] "ISO8859-1"

And this is what I get on my NAS (linux based, hadleyverse docker image):

> localeToCharset()
[1] "UTF-8"     "ISO8859-1"

jankatins avatar Apr 10 '16 20:04 jankatins

And here is an example of the evaluate problem (both executed in an RStudio window...):

library(evaluate)

code <- "
x = '法'
y = '\\u8FDB'
print(nchar(x))
print(nchar(y))
print(x)
print(y)
"

l = list()
txt <- function(o, type) {
  t <- paste(o, collapse = '\n')
  l[length(l)+1] <<- t
}
oh <- new_output_handler(source = identity, 
                         text = function(o) txt(o, "text"), 
                         graphics = identity,
                         message = identity, 
                         warning = identity, 
                         error = identity, 
                         value = identity)

x <- evaluate(code, output_handler = oh)
l

Windows:

> Encoding(code)
[1] "UTF-8"
> parse(text=code)
expression(x = '<U+6CD5>', y = '\u8FDB', print(nchar(x)), print(nchar(y)), 
    print(x), print(y))
> l
[[1]]
[1] "[1] 8\n" #> bad in

[[2]]
[1] "[1] 1\n" # ok if escaped...

[[3]]
[1] "[1] \"<U+6CD5>\"\n" # -> Just the bad in

[[4]]
[1] "[1] \"<U+8FDB>\"\n" # -> but here it's bad out...

Linux (NAS):

> Encoding(code)
[1] "UTF-8"
> parse(text=code)
expression(x = '法', y = '\u8FDB', print(nchar(x)), print(nchar(y)), 
    print(x), print(y))
> l
[[1]]
[1] "[1] 1\n"

[[2]]
[1] "[1] 1\n"

[[3]]
[1] "[1] \"法\"\n"

[[4]]
[1] "[1] \"进\"\n"

jankatins avatar Apr 10 '16 20:04 jankatins

And even further down for the input problem:

Windows:

> parse(text='"法 \\u8FDB"')
expression("<U+6CD5> \u8FDB")

Linux:

> parse(text='"法 \\u8FDB"')
expression("法 \u8FDB")

jankatins avatar Apr 10 '16 21:04 jankatins

If someone wants to have fun: c sources of parse: https://github.com/wch/r-source/blob/e5b21d0397c607883ff25cca379687b86933d730/src/main/source.c#L193

I tried to set my locale, but everything I tired was rejected by Sys.setlocales(...).

jankatins avatar Apr 10 '16 22:04 jankatins

thanks for digging into this. i think you were almost there. parse has an encoding argument.

i filed hadley/evaluate#66.

depending on how it is resolved (automatic/manually) we might need to extract and specify the encoding when calling a fixed/enhanced version of evaluate or not.

flying-sheep avatar Apr 11 '16 08:04 flying-sheep

I tried that argument and it didn't make any difference :-(

jankatins avatar Apr 11 '16 09:04 jankatins

I updated hadley/evaluate#66 with code examples which demonstrate what goes wrong here...

jankatins avatar Apr 11 '16 11:04 jankatins

Current status here: it's an ustream bug and we have some workarounds (warn if unicode input and don't send the eclipse char on such systems. So not a blocker for the next release IMO -> restor teh milestone if you have a different opinion...]

jankatins avatar Apr 21 '16 15:04 jankatins

But HTML output is now being escaped, right? So you can at least see <U+884C>?

takluyver avatar Apr 21 '16 16:04 takluyver

But HTML output is now being escaped, right? So you can at least see <U+884C>?

Yes and no: yes because html is escaped and no, because of https://github.com/IRkernel/repr/pull/43 I see three dots (=3 chars).

But "OUT" is not the problem: you always see something, it's just escaped in the funny <U+xxxx> and therefore not C&P-able... "IN" is the bigger problem, but that was taken care of in https://github.com/IRkernel/IRkernel/pull/296

jankatins avatar Apr 21 '16 17:04 jankatins

Since R 4.2, it has support for UTF-8 support in windows. Anything one needs to do there or will it just work?

flying-sheep avatar Jun 27 '22 08:06 flying-sheep