visioning_texts icon indicating copy to clipboard operation
visioning_texts copied to clipboard

[BUG] Emoji parsing not working from Facebook json file

Open Gusman10000 opened this issue 5 years ago • 7 comments

Describe the bug Emoji's don't appear to be imported properly when importing a FB .json Message file, instead appearing as other odd unicode symbols

To Reproduce

  1. Import a Facebook .json file and view the results. For me nothing more needed to be done

Expected behavior .json file imported completely with all symbols being properly identified

Screenshots I've never sent this odd symbol (2nd down) in Messenger in my life. "It'd" has also been converted weirdly here too: Capture

Desktop:

  • OS: Windows 10
  • Browser Firefox
  • Version [e.g. 22]

Additional context Yesterday I was writing a parser in python for these .json files to convert them into a WhatsApp text file and I ran into this exact problem. Initially the code would convert the first byte of an emoji and ignore the rest.

In Python I found the fix for this would be:

def fixup_string(text): return text.encode('latin1').decode('utf8')

I'm not well versed in js, so I'm not sure what the translation would be. The screenshot below shows a simple example using content from a message I pulled from my .json of the issue in Python, as well as the solution:

Capture2

Gusman10000 avatar Jan 22 '20 00:01 Gusman10000

Interesting. I'll need a bit of time to get my own FB data to test this with. Thanks for the report and details though, it makes this a lot easier to approach.

BryceStevenWilley avatar Jan 22 '20 04:01 BryceStevenWilley

Might be able to take a look at this. Going to try it

htkcodes avatar Jan 22 '20 04:01 htkcodes

Did a little reading and found this method of encoding / decoding utf8 in js.

I'm not really sure of what I'm doing in js, but I tried adding the decode code in the math.js facebook import function in a few spots and had success with having the emoji's showing up by changing:

'BODY': msg.content to 'BODY': decodeURIComponent(escape(msg.content))

This seems to work as the emoji's now appear to register (I get the emoji map and they appear in the word use difference part).

That said I do get numbers and common symbols showing up in the word use difference, but they're things like 6, 10, *, &, 6:30, etc. Are any of these meant to be filtered out of this? If not then I think this change gets it working

Gusman10000 avatar Jan 23 '20 04:01 Gusman10000

Hey Gusman, I added your fix in https://github.com/BryceStevenWilley/visioning_texts/commit/167962724fe92d24c89ddac8b28eb0048ee96fab, thanks for the help! I'll double check that it works with my FB info, and close this issue when it does.

And at the moment, yeah, common numbers and symbols aren't filtered from the word difference. That's being tracked in #10.

BryceStevenWilley avatar Jan 23 '20 04:01 BryceStevenWilley

Works for me, I've got some emoji's in the emoji count!

BryceStevenWilley avatar Jan 23 '20 05:01 BryceStevenWilley

Hi guys, this method doesn't work for all emojis for example.

'\u00f3\u00be\u008c\u00ac\u00f3\u00be\u008c\u00ac\u00f3\u00be\u008c\u00a7'

Hence why i didn't bring it up earlier.

htkcodes avatar Jan 23 '20 18:01 htkcodes

'\u00f3\u00be\u008c\u00ac\u00f3\u00be\u008c\u00ac\u00f3\u00be\u008c\u00a7'

Took me too long to figure out that this is 😘😘😍. Sorry for closing too soon @htkcodes.

BryceStevenWilley avatar Jan 23 '20 18:01 BryceStevenWilley