emojibase icon indicating copy to clipboard operation
emojibase copied to clipboard

None of the regexes match emoji, and only emoji

Open robintown opened this issue 1 year ago • 5 comments

A regex that matches emoji would be a really useful thing to have in the JS ecosystem! Unfortunately, between Emojibase and emoji-regex, I still haven't seen a package that actually does this. In the case of Emojibase:

  • emojibase-regex matches some textual characters such as '↔'.
  • emojibase-regex/emoji doesn't match emoji without U+FE0F, such as '✨'.
  • emojibase-regex/emoji-loose matches some textual characters without U+FE0E, such as '↔'.
  • And the rest of the provided regexes are obviously not intended to be used for matching emoji.

What's missing is a regex that matches exactly those character sequences that are presented to users as emoji. Some characters are defined in Unicode to default to emoji presentation (see the Emoji_Presentation section), while others require U+FE0F to change their presentation mode. A correct implementation would account for both of these facts, and use a negative lookahead to avoid matching characters with U+FE0E.

robintown avatar Jun 05 '24 22:06 robintown

I'll be honest, it's been so long since I've worked on this emoji stuff that I've forgotten a lot of how they work. I always have to re-learn the codebase each time I update it. So I'm sure there's bugs everywhere.

With that said, I am tinkering with the regex's here: https://github.com/milesj/emojibase/pull/175

milesj avatar Jun 07 '24 23:06 milesj

So after looking at this post and the code again, this assumption is correct in how it works. It's by design.

  • emojibase-regex matches some textual characters such as '↔'.
  • emojibase-regex/emoji doesn't match emoji without U+FE0F, such as '✨'.
  • emojibase-regex/emoji-loose matches some textual characters without U+FE0E, such as '↔'.
  • And the rest of the provided regexes are obviously not intended to be used for matching emoji.

I also use regexgen (https://github.com/devongovett/regexgen) to generate the regex pattern, and it does not support negative lookaheads. I'm not aware of another library to handle this and I'm definitely not going to write it from scratch.

There is a regex using unicode properties, but I haven't tested it in years: https://emojibase.dev/docs/regex#unicode-property-support

milesj avatar Jun 09 '24 00:06 milesj

Been thinking about this more, and I think we could solve this by using functions, like isEmojiPresentation and isTextPresentation, instead of relying purely on RegExp instances. With functions we could run the necessary checks to ensure it's exactly what you want.

milesj avatar Jun 09 '24 18:06 milesj

Re: the Unicode properties approach, I was happy to discover that the new RegExp v mode makes writing an emoji regex by hand pretty easy, and this is what I've ended up going for.

/\p{RGI_Emoji}(?!\uFE0E)(?:(?<!\uFE0F)\uFE0F)?/v

All major browsers support it, though only as of late 2023. You can get a version that kinda sorta works while only using u mode if you replace \p{RGI_Emoji} with this regex, but it's not going to do well with flags and ZWJ sequences unless you teach it exactly what the valid sequences are.

robintown avatar Jun 12 '24 14:06 robintown

Nice, good to know! Been waiting years for all those to become available.

milesj avatar Jun 12 '24 16:06 milesj