RJSONIO icon indicating copy to clipboard operation
RJSONIO copied to clipboard

Issue handling Unicode characters

Open asieira opened this issue 12 years ago • 4 comments

See how the behavior of RJSONIO differs from rjson in the following case:

A short snippet of code that demonstrates the problem:

> library(RJSONIO)
> nchar(fromJSON("{ \"S\": \"L\\u00e9vis\"\n        }"))
Error in nchar(fromJSON("{ \"S\": \"L\\u00e9vis\"\n        }")) : 
  invalid multibyte string 1

See how ‘rjson’ handles the same string, which does not cause the error:

> library(rjson)
> fromJSON("{ \"S\": \"L\\u00e9vis\"\n        }")
$S
[1] "Lévis"
> detach("package:rjson", unload=TRUE)
> library(RJSONIO)
> fromJSON("{ \"S\": \"L\\u00e9vis\"\n        }")
         S 
"L\xe9vis" 

After some testing I found that forcing the encoding to 'latin1' seems to do the trick. Which is odd, since:

> Sys.getlocale()
[1] "pt_BR.UTF-8/pt_BR.UTF-8/pt_BR.UTF-8/C/pt_BR.UTF-8/pt_BR.UTF-8"

And this also struck me as odd:

> library(jsonlite)
> s = fromJSON("{ \"S\": \"L\\u00e9vis\"\n        }")
> Encoding(s$S)
[1] "unknown"
> iconv(s$S, from="latin1", to="")
[1] "Lévis"

It's as though the string had been encoded as latin1 for some reason, but was not properly marked as such.

I reported this same issue to the jsonlite package as well (https://github.com/jeroenooms/jsonlite/issues/5), since it and RJSONIO share some code and they both have the same issue.

asieira avatar Dec 27 '13 12:12 asieira

I've been investigating this, but I can't figure out if it's a problem in libjson or RJSONIO code. It looks like indeed somewhere in the process a latin1 string is incorrectly marked as UTF8. But if you mix actual unicode and escaped unicode:

fromJSON('["L\\u00e9vis-éé"]')

The unescaped unicode is fine. There is a flag in libjson called JSON_UNICODE but that cannot be used in combination with JSON_ISO_STRICT.

jeroen avatar Jan 18 '14 19:01 jeroen

@duncantl is there any reason we need JSON_ISO_STRICT? Could we get it to work with JSON_UNICODE instead?

jeroen avatar Jan 18 '14 21:01 jeroen

@duncantl this looks like a bug where the character vector is not properly initiated. In particular the error given by nchar might give a hint at what's going wrong?

library(RJSONIO)
x <- fromJSON('["Z\\u00FCrich"]')
print(x)
[1] "Z\xfcrich"

> nchar(x)
Error in nchar(x) : invalid multibyte string 1

> #This fixes it:
> Encoding(x) <- "latin1"
> print(x)
[1] "Zürich"

jeroen avatar Jul 07 '14 19:07 jeroen

Apparently @jeroenooms was able to fix this on jsonlite: https://github.com/jeroenooms/jsonlite/issues/5#issuecomment-50172522

asieira avatar Jul 25 '14 16:07 asieira