twemoji use fully qualified / unified names in svg and png paths

This PR attempts to fix issues mentioned in #405 and #419 by using unicode fully qualified / unified names in paths for svgs and pngs.

To get the fully qualified name for an emoji, I used the emoji.json provided by https://github.com/iamcal/emoji-data and the following node script.

const fs = require('fs');
let emojiList = JSON.parse(fs.readFileSync("./emoji.json"))

// Flatten and append skin variations as a separate emojis to emojiList.
emojiList.filter(e => e.skin_variations)
  .forEach(e => emojiList = emojiList.concat(Object.values(e.skin_variations)))

function unifiedToNative(unified) {
  const codePoints = unified.split('-').map(u => `0x${u}`);
  return String.fromCodePoint.apply(String, codePoints);
}

// Convert unicode to native represetation.
emojiList.forEach(e => e.native = unifiedToNative(e.unified))

// Parse each native representation into a twemoji entity.
const { parse } = require('twemoji-parser');
emojiList.forEach(e => e.entity = parse(e.native)[0])

function getTwemojiUnicode(url) {
  return url.match(/([^\/]+)(?=\.\w+$)/)[0]
}

// Get the twemoji unicode representation from entity url.
emojiList.forEach(e => e.twemojiUnicode = getTwemojiUnicode(e.entity.url))

// Calculate the list of emojis where twemoji and unified or non_qualified differ.
let diff = emojiList.filter(e => e.twemojiUnicode !== e.unified.toLowerCase())
  .filter(d => d.twemojiUnicode !== "1f441") // BUG: see https://github.com/twitter/twemoji/issues/419

diff.forEach(e => { 
  fs.renameSync(`./assets/72x72/${e.twemojiUnicode}.png`, `./assets/72x72/${e.unified.toLowerCase()}.png`); 
  fs.renameSync(`./assets/svg/${e.twemojiUnicode}.svg`, `./assets/svg/${e.unified.toLowerCase()}.svg`); 
})

// To-do: manually handle 1f441.

The only exception to this was the eye emoji mentioned in #405, because both 👁️ and 👁️‍🗨️ resolve to "1f441" with the twemoji-parser. For the eye emoji, I had to manually rename two files.

Jul 05 '20 05:07 BrianHung

All committers have signed the CLA.

Jul 05 '20 05:07 CLAassistant

Thanks for giving it a shot! Twemoji and twemoji-parser are intended to be interoperable as part of how we use them at Twitter, so we're working on a more complete solution internally to #405, hopefully by the end of the year. Since this breaks interoperability and would cause a pretty substantial divergence in our internal vs open sourced version of this package, I'm leaving it open for now.

Oct 13 '20 21:10 jdecked

In my opinion, it should be the other way around: instead of using a fully qualified sequence, remove all modifiers and variant selectors. For instance, U+FE0F (VS-16) exists to indicate that a character should be rendered as a colourful image rather than monochrome text. Since those files are already images, it's not needed. Same for U+200D (ZWJ) which is used to join several characters as one. It's already a single file so the joiner isn't meaningful.

In addition to being shorter, it's more robust against possible changes in future Unicode versions if some sequences are retooled to make some of those characters optional.

Nov 15 '20 01:11 JoshyPHP