remove-accents experiment using unicode decomposition & regex char ranges

DRAFT: this is not really intended for merging but instead as a discussion point regarding how we might be able to identify accents without manually enumerating them all.

The tl;dr is:

string
    .normalize('NFKD')
    .replace(COMBINING_MARKS, '')
    .normalize('NFKC')

There is a description of the method in https://github.com/tyxla/remove-accents/pull/44#issuecomment-1640742011 and some more related discussion in https://github.com/tyxla/remove-accents/issues/12#issuecomment-1609367144, the ranges have been lifted from another project I worked on.

I'd like to open up a chat about this method, I think it's quite interesting, all the tests pass except for the one which enumerates a long list of characters.

Interestingly characters like Ø don't decompose with this method, what does that mean exactly? does it mean that converting it to a latin capital O is correct or incorrect?

# remove accents from string
not ok 1 should be equivalent
  ---
    operator: deepEqual
    expected: |-
      'AAAAAAAAAEAACCEEEEEEEEIIIIIDNOOOOOOOOOUUUUYaaaaaaaaaeaacceeeeeeeeiiiiinooooooooouuuuyyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgGgHhHhIiIiIiIiIiIJijJjKkKkLlLlLlLlllMmNnNnNnnOoOoOoOEoeRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwWwYyYZzZzZzsfOoUuAaIiOoUuUuUuUuUuUuUuAaAEaeOodTHthPpSsXxГгКкAaEeIiNnOoOoUuWwYyAaEeIiOoRrUuAAAEaaaeCcHhKkMmNnPpRrTtVvXxYyAEIOaeioRrUustSTBbFfGgHhJjKkMmPpQqSsVvWwXxYyAaBbDdEeEeHhIiIiMmOoQqUuXxZzss'
    actual: |-
      'AAAAAAAAÆAACCEEEEEEEEIIIIIÐNOOOOOØOOOUUUUYaaaaaaaaæaacceeeeeeeeiiiiinoooooøooouuuuyyAaAaAaCcCcCcCcDdĐđEeEeEeEeEeGgGgGgGgGgHhĦħIiIiIiIiIıIJijJjKkKkLlLlLlL·l·ŁłMmNnNnNnʼnOoOoOoŒœRrRrRrSsSsSsSsTtTtŦŧUuUuUuUuUuUuWwWwYyYZzZzZzsƒOoUuAaIiOoUuUuUuUuUuUuUuAaÆæØøðÞþPpSsXxГгКкAaEeIiNnOoOoUuWwYyAaEeIiOoRrUuAAAEaaaeCcHhKkMmNnPpRrTtVvXxYyAEIOaeioRrUustSTBbFfGgHhJjKkMmPpQqSsVvWwXxYyAaBbDdEeƐɛHhIiƗɨMmOoQqUuXxZzß'

edit: sorry about the formatting, my editor seems to have automatically applied the Standard JS style, I can revert those change if we decide to proceed with it.

Jul 18 '23 18:07 missinglink

Note that the regenerate dependency can be removed in favour of the pattern it generates, namely:

[\u0300-\u036F\u1AB0-\u1AFF\u1DC0-\u1DFF\u200D\u20D0-\u20FF\u3099\u309A\uFE00-\uFE0F\uFE20-\uFE2F]

Jul 18 '23 19:07 missinglink

Thanks for the PR, @missinglink 🙌

Interestingly characters like Ø don't decompose with this method, what does that mean exactly? does it mean that converting it to a latin capital O is correct or incorrect?

Well, we intentionally added replacement for these characters and it's a good example of why such a library is preferred to using String.normalize().

That being said, I'd welcome a simplification of the current approach that supports all current characters that we replace.

Jul 20 '23 11:07 tyxla

remove-accents remove-accents copied to clipboard

experiment using unicode decomposition & regex char ranges

remove-accents
remove-accents copied to clipboard