Splitting icon indicating copy to clipboard operation
Splitting copied to clipboard

Emoji/Unicode Range Support

Open notoriousb1t opened this issue 6 years ago • 14 comments

Splitting does not appear to work with certain unicode ranges. ⚡️works but some other emojis do not. Maybe this is an issue with "".split()

notoriousb1t avatar Aug 28 '18 02:08 notoriousb1t

Yes; I assume there's some unicode issues. It may be best to split on /\S/. Not sure if that would help.

shshaw avatar Aug 28 '18 15:08 shshaw

Created an emoji-support branch to work through this.

From our conversation:

shshaw [10:30 AM] Lodash does seem to work. https://codepen.io/shshaw/pen/451e393401663892e0fee944575d4bd2 In total, lodash is ~4kb gzipped… I wonder with treeshaking how small just toArray could be chars = _.toArray(wholeText)

notoriousb1t [10:36 AM] it probably wouldn't add a whole lot

shshaw [10:37 AM] We could potentially simplify the logic overall with it

shshaw [10:37 AM] There’s a lodash-es for an ES6 version of lodash That may help with treeshaking It may be this simple: https://www.neontsunami.com/posts/allow-treeshaking-with-lodash

shshaw avatar Aug 28 '18 15:08 shshaw

This seems to be the source for lodash's emoji-processing toArray, for reference: https://github.com/lodash/lodash/blob/4ea8c2ec249be046a0f4ae32539d652194caf74f/.internal/unicodeToArray.js

In theory we could probably simplify from that, but ideally we could just import that and/or the stringToArray function with treeshaking and not have to maintain the unicode/RegEx: https://github.com/lodash/lodash/blob/4ea8c2ec249be046a0f4ae32539d652194caf74f/.internal/stringToArray.js

shshaw avatar Aug 28 '18 15:08 shshaw

Unfortunately if you want to capture the full nuance of emoji sequences, you end up needing to do something at least as complex as lodash's unicodeToArray.

You can go with some simpler options if you're okay with some broken edge cases.

jhnsnc avatar Aug 29 '18 18:08 jhnsnc

I think the first goal is to improve support for it, not necessarily to support all nuances of it.

notoriousb1t avatar Aug 29 '18 18:08 notoriousb1t

Yes. Goal wouldn't necessarily be complete support of all permutations, but widest support at the smallest file size. Looks like Lodash's regex could compress down to about 628 bytes (223 gzipped), so that's the goal to beat.

shshaw avatar Aug 29 '18 18:08 shshaw

:shipit:

jhnsnc avatar Aug 30 '18 01:08 jhnsnc

https://emojipedia.org/zero-width-joiner/ For further research

shshaw avatar Feb 07 '19 14:02 shshaw

https://emojipedia.org/emoji-zwj-sequences/

shshaw avatar Feb 07 '19 14:02 shshaw

I think this could be an interesting point about char splitting for unicodes/emojis: https://stackoverflow.com/a/38901550/7355534

bastienrobert avatar Aug 31 '19 21:08 bastienrobert

Great reference! Thank you.

shshaw avatar Sep 02 '19 00:09 shshaw

Reference https://thekevinscott.com/emojis-in-javascript/

shshaw avatar Oct 31 '19 18:10 shshaw

Reference: https://github.com/davatron5000/Lettering.js/blob/a4c6b18c28ecc50675937b10e88328473dbb15ce/jquery.lettering.js#L34

shshaw avatar Feb 03 '21 16:02 shshaw