RedditExtractor icon indicating copy to clipboard operation
RedditExtractor copied to clipboard

Unicode handling has been broken

Open nmtake opened this issue 2 years ago • 2 comments

Describe the bug

RedditExtractoR returns broken comment body if it contains non-ASCII Unicode.

To Reproduce

$ R_LIBS_USER=lib R
> devtools::load_all()
ℹ Loading RedditExtractoR
> options(HTTPUserAgent = 'API Test (by /u/nmtake)')
> thread = get_thread_content('https://old.reddit.com/r/translator/comments/10wr3xg/')
> thread$comments$comment
[1] "ö\017f\n&Ø_\u008c"

Expected behavior

> thread = get_thread_content('https://old.reddit.com/r/translator/comments/10wr3xg/')
> thread$comments$comment
[1] "零式艦上戦闘機二一型 Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"

Desktop (please complete the following information):

  • Linux fedora 6.1.7-200.fc37.x86_64
  • RedditExtractoR commit ecb9a86e
  • R 4.2.2
  • JSONIO 1.3-1.8

Additional context

Here are the details. I tried to get this comment that contains non-ASCII Characters:

零式艦上戦闘機二一型 Type zero carrier fighter model 21

https://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero

but RedditExtractoR returns broken comment:

$ R_LIBS_USER=lib R
> devtools::load_all()
ℹ Loading RedditExtractoR
> options(HTTPUserAgent = 'API Test (by /u/nmtake)')
> thread = get_thread_content('https://old.reddit.com/r/translator/comments/10wr3xg/')
> thread$comments$comment
[1] "ö\017f\n&Ø_\u008c"
[...]

It's because reddit's JSON escapes non-ASCII characters,

$ curl -A 'API Test (by /u/nmtake)' 'https://old.reddit.com/r/translator/comments/10wr3xg/.json' > japanese.json
$ cat japanese.json
[...]
"body": "\u96f6\u5f0f\u8266\u4e0a\u6226\u95d8\u6a5f\u4e8c\u4e00\u578b Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero",

and RJSONIO doesn't seem to be able to handle such unicode escapes:

> ret = RJSONIO::fromJSON('["\\u96f6\\u5f0f\\u8266\\u4e0a\\u6226\\u95d8\\u6a5f\\u4e8c\\u4e00\\u578b Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"]', asText = TRUE)
> ret
[1] "\xf6\017f\n&\xd8_\x8c"
> iconv(ret, 'latin1', 'UTF-8')  # reproduce the original broken text
[1] "ö\017f\n&Ø_\u008c"

Please note that the trailing characters after \\u4e00 are all dropped. I suspect RJSONIO treats 00 as ASCII NIL (C string terminator).


FYI, with jsonlite::fromJSON(simplifyVector = FALSE), we can get correct text:

> jsonlite::fromJSON('["\\u96f6\\u5f0f\\u8266\\u4e0a\\u6226\\u95d8\\u6a5f\\u4e8c\\u4e00\\u578b Type zero carrier fighter model 21\\n\\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"]', simplifyVector = FALSE)
[1] "零式艦上戦闘機二一型 Type zero carrier fighter model 21\n\nhttps://en.wikipedia.org/wiki/Mitsubishi_A6M_Zero"

nmtake avatar Feb 08 '23 23:02 nmtake

@nmtake thanks for reporting this!

I'm afraid that I'll be doing a lot of travelling in the near future and I may not have the time to look into this soon, so this issue will have to wait. Until then, if you have a solution in mind, then you may consider creating a PR yourself. I imagine the fix will have to sit somewhere around here.

ivan-rivera avatar Feb 09 '23 04:02 ivan-rivera

A quick update: I had a look at this earlier and attempted to fix it, but it seems to be a tricky problem. Migrating from RJSONIO to jsonlite or httr might help, but it would introduce some substantial breaking changing which I'm not reading to work on at present.

ivan-rivera avatar Mar 16 '23 09:03 ivan-rivera