RedditExtractor Unicode handling has been broken

Describe the bug

RedditExtractoR returns broken comment body if it contains non-ASCII Unicode.

To Reproduce

$ R_LIBS_USER=lib R
> devtools::load_all()
ℹ Loading RedditExtractoR
> options(HTTPUserAgent = 'API Test (by /u/nmtake)')
> thread = get_thread_content('https://old.reddit.com/r/translator/comments/10wr3xg/')
> thread$comments$comment
[1] "ö\017f\n&Ø_\u008c"

Expected behavior

> thread = get_thread_content('https://old.reddit.com/r/translator/comments/10wr3xg/')
> thread$comments$comment
[1] "零式艦上戦闘機二一型 Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"

Desktop (please complete the following information):

Linux fedora 6.1.7-200.fc37.x86_64
RedditExtractoR commit ecb9a86e
R 4.2.2
JSONIO 1.3-1.8

Additional context

Here are the details. I tried to get this comment that contains non-ASCII Characters:

零式艦上戦闘機二一型 Type zero carrier fighter model 21

https://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero

but RedditExtractoR returns broken comment:

$ R_LIBS_USER=lib R
> devtools::load_all()
ℹ Loading RedditExtractoR
> options(HTTPUserAgent = 'API Test (by /u/nmtake)')
> thread = get_thread_content('https://old.reddit.com/r/translator/comments/10wr3xg/')
> thread$comments$comment
[1] "ö\017f\n&Ø_\u008c"
[...]

It's because reddit's JSON escapes non-ASCII characters,

$ curl -A 'API Test (by /u/nmtake)' 'https://old.reddit.com/r/translator/comments/10wr3xg/.json' > japanese.json
$ cat japanese.json
[...]
"body": "\u96f6\u5f0f\u8266\u4e0a\u6226\u95d8\u6a5f\u4e8c\u4e00\u578b Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero",

and RJSONIO doesn't seem to be able to handle such unicode escapes:

> ret = RJSONIO::fromJSON('["\\u96f6\\u5f0f\\u8266\\u4e0a\\u6226\\u95d8\\u6a5f\\u4e8c\\u4e00\\u578b Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"]', asText = TRUE)
> ret
[1] "\xf6\017f\n&\xd8_\x8c"
> iconv(ret, 'latin1', 'UTF-8')  # reproduce the original broken text
[1] "ö\017f\n&Ø_\u008c"

Please note that the trailing characters after \\u4e00 are all dropped. I suspect RJSONIO treats 00 as ASCII NIL (C string terminator).

FYI, with jsonlite::fromJSON(simplifyVector = FALSE), we can get correct text:

> jsonlite::fromJSON('["\\u96f6\\u5f0f\\u8266\\u4e0a\\u6226\\u95d8\\u6a5f\\u4e8c\\u4e00\\u578b Type zero carrier fighter model 21\\n\\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"]', simplifyVector = FALSE)
[1] "零式艦上戦闘機二一型 Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"

Feb 08 '23 23:02 nmtake

@nmtake thanks for reporting this!

I'm afraid that I'll be doing a lot of travelling in the near future and I may not have the time to look into this soon, so this issue will have to wait. Until then, if you have a solution in mind, then you may consider creating a PR yourself. I imagine the fix will have to sit somewhere around here.

Feb 09 '23 04:02 ivan-rivera

A quick update: I had a look at this earlier and attempted to fix it, but it seems to be a tricky problem. Migrating from RJSONIO to jsonlite or httr might help, but it would introduce some substantial breaking changing which I'm not reading to work on at present.

Mar 16 '23 09:03 ivan-rivera

RedditExtractor RedditExtractor copied to clipboard

Unicode handling has been broken

RedditExtractor
RedditExtractor copied to clipboard