Splitting
Splitting copied to clipboard
Emoji/Unicode Range Support
Splitting does not appear to work with certain unicode ranges. ⚡️works but some other emojis do not. Maybe this is an issue with "".split()
Yes; I assume there's some unicode issues. It may be best to split on /\S/
. Not sure if that would help.
Created an emoji-support branch to work through this.
From our conversation:
shshaw [10:30 AM] Lodash does seem to work. https://codepen.io/shshaw/pen/451e393401663892e0fee944575d4bd2 In total, lodash is ~4kb gzipped… I wonder with treeshaking how small just toArray could be
chars = _.toArray(wholeText)
notoriousb1t [10:36 AM] it probably wouldn't add a whole lot
shshaw [10:37 AM] We could potentially simplify the logic overall with it
shshaw [10:37 AM] There’s a lodash-es for an ES6 version of lodash That may help with treeshaking It may be this simple: https://www.neontsunami.com/posts/allow-treeshaking-with-lodash
This seems to be the source for lodash's emoji-processing toArray
, for reference:
https://github.com/lodash/lodash/blob/4ea8c2ec249be046a0f4ae32539d652194caf74f/.internal/unicodeToArray.js
In theory we could probably simplify from that, but ideally we could just import that and/or the stringToArray function with treeshaking and not have to maintain the unicode/RegEx: https://github.com/lodash/lodash/blob/4ea8c2ec249be046a0f4ae32539d652194caf74f/.internal/stringToArray.js
Unfortunately if you want to capture the full nuance of emoji sequences, you end up needing to do something at least as complex as lodash's unicodeToArray
.
You can go with some simpler options if you're okay with some broken edge cases.
I think the first goal is to improve support for it, not necessarily to support all nuances of it.
Yes. Goal wouldn't necessarily be complete support of all permutations, but widest support at the smallest file size. Looks like Lodash's regex could compress down to about 628 bytes (223 gzipped), so that's the goal to beat.
:shipit:
https://emojipedia.org/zero-width-joiner/ For further research
https://emojipedia.org/emoji-zwj-sequences/
I think this could be an interesting point about char splitting for unicodes/emojis: https://stackoverflow.com/a/38901550/7355534
Great reference! Thank you.
Reference https://thekevinscott.com/emojis-in-javascript/
Reference: https://github.com/davatron5000/Lettering.js/blob/a4c6b18c28ecc50675937b10e88328473dbb15ce/jquery.lettering.js#L34