Issue handling Unicode characters
See how the behavior of RJSONIO differs from rjson in the following case:
A short snippet of code that demonstrates the problem:
> library(RJSONIO)
> nchar(fromJSON("{ \"S\": \"L\\u00e9vis\"\n }"))
Error in nchar(fromJSON("{ \"S\": \"L\\u00e9vis\"\n }")) :
invalid multibyte string 1
See how ‘rjson’ handles the same string, which does not cause the error:
> library(rjson)
> fromJSON("{ \"S\": \"L\\u00e9vis\"\n }")
$S
[1] "Lévis"
> detach("package:rjson", unload=TRUE)
> library(RJSONIO)
> fromJSON("{ \"S\": \"L\\u00e9vis\"\n }")
S
"L\xe9vis"
After some testing I found that forcing the encoding to 'latin1' seems to do the trick. Which is odd, since:
> Sys.getlocale()
[1] "pt_BR.UTF-8/pt_BR.UTF-8/pt_BR.UTF-8/C/pt_BR.UTF-8/pt_BR.UTF-8"
And this also struck me as odd:
> library(jsonlite)
> s = fromJSON("{ \"S\": \"L\\u00e9vis\"\n }")
> Encoding(s$S)
[1] "unknown"
> iconv(s$S, from="latin1", to="")
[1] "Lévis"
It's as though the string had been encoded as latin1 for some reason, but was not properly marked as such.
I reported this same issue to the jsonlite package as well (https://github.com/jeroenooms/jsonlite/issues/5), since it and RJSONIO share some code and they both have the same issue.
I've been investigating this, but I can't figure out if it's a problem in libjson or RJSONIO code. It looks like indeed somewhere in the process a latin1 string is incorrectly marked as UTF8. But if you mix actual unicode and escaped unicode:
fromJSON('["L\\u00e9vis-éé"]')
The unescaped unicode is fine. There is a flag in libjson called JSON_UNICODE but that cannot be used in combination with JSON_ISO_STRICT.
@duncantl is there any reason we need JSON_ISO_STRICT? Could we get it to work with JSON_UNICODE instead?
@duncantl this looks like a bug where the character vector is not properly initiated. In particular the error given by nchar might give a hint at what's going wrong?
library(RJSONIO)
x <- fromJSON('["Z\\u00FCrich"]')
print(x)
[1] "Z\xfcrich"
> nchar(x)
Error in nchar(x) : invalid multibyte string 1
> #This fixes it:
> Encoding(x) <- "latin1"
> print(x)
[1] "Zürich"
Apparently @jeroenooms was able to fix this on jsonlite: https://github.com/jeroenooms/jsonlite/issues/5#issuecomment-50172522