RedditExtractor
RedditExtractor copied to clipboard
Unicode handling has been broken
Describe the bug
RedditExtractoR returns broken comment body if it contains non-ASCII Unicode.
To Reproduce
$ R_LIBS_USER=lib R
> devtools::load_all()
ℹ Loading RedditExtractoR
> options(HTTPUserAgent = 'API Test (by /u/nmtake)')
> thread = get_thread_content('https://old.reddit.com/r/translator/comments/10wr3xg/')
> thread$comments$comment
[1] "ö\017f\n&Ø_\u008c"
Expected behavior
> thread = get_thread_content('https://old.reddit.com/r/translator/comments/10wr3xg/')
> thread$comments$comment
[1] "零式艦上戦闘機二一型 Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"
Desktop (please complete the following information):
- Linux fedora 6.1.7-200.fc37.x86_64
- RedditExtractoR commit ecb9a86e
- R 4.2.2
- JSONIO 1.3-1.8
Additional context
Here are the details. I tried to get this comment that contains non-ASCII Characters:
零式艦上戦闘機二一型 Type zero carrier fighter model 21
https://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero
but RedditExtractoR returns broken comment:
$ R_LIBS_USER=lib R
> devtools::load_all()
ℹ Loading RedditExtractoR
> options(HTTPUserAgent = 'API Test (by /u/nmtake)')
> thread = get_thread_content('https://old.reddit.com/r/translator/comments/10wr3xg/')
> thread$comments$comment
[1] "ö\017f\n&Ø_\u008c"
[...]
It's because reddit's JSON escapes non-ASCII characters,
$ curl -A 'API Test (by /u/nmtake)' 'https://old.reddit.com/r/translator/comments/10wr3xg/.json' > japanese.json
$ cat japanese.json
[...]
"body": "\u96f6\u5f0f\u8266\u4e0a\u6226\u95d8\u6a5f\u4e8c\u4e00\u578b Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero",
and RJSONIO doesn't seem to be able to handle such unicode escapes:
> ret = RJSONIO::fromJSON('["\\u96f6\\u5f0f\\u8266\\u4e0a\\u6226\\u95d8\\u6a5f\\u4e8c\\u4e00\\u578b Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"]', asText = TRUE)
> ret
[1] "\xf6\017f\n&\xd8_\x8c"
> iconv(ret, 'latin1', 'UTF-8') # reproduce the original broken text
[1] "ö\017f\n&Ø_\u008c"
Please note that the trailing characters after \\u4e00
are all dropped.
I suspect RJSONIO treats 00
as ASCII NIL (C string terminator).
FYI, with jsonlite::fromJSON(simplifyVector = FALSE)
, we can get correct text:
> jsonlite::fromJSON('["\\u96f6\\u5f0f\\u8266\\u4e0a\\u6226\\u95d8\\u6a5f\\u4e8c\\u4e00\\u578b Type zero carrier fighter model 21\\n\\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"]', simplifyVector = FALSE)
[1] "零式艦上戦闘機二一型 Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"
@nmtake thanks for reporting this!
I'm afraid that I'll be doing a lot of travelling in the near future and I may not have the time to look into this soon, so this issue will have to wait. Until then, if you have a solution in mind, then you may consider creating a PR yourself. I imagine the fix will have to sit somewhere around here.
A quick update: I had a look at this earlier and attempted to fix it, but it seems to be a tricky problem. Migrating from RJSONIO
to jsonlite
or httr
might help, but it would introduce some substantial breaking changing which I'm not reading to work on at present.