polyfill icon indicating copy to clipboard operation
polyfill copied to clipboard

grapheme_strlen shows different length of emoji ZWJ Sequence when compared to native

Open Luc45 opened this issue 4 years ago โ€ข 6 comments

Take the following emoji for instance: ๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ

This emoji consists of four different emojis glued together by Zero Width Joiner characters, as seen on https://emojipedia.org/family-woman-woman-boy-boy/.

When checking the length with grapheme_strlen(), it returns 1, while this library returns 4.

This is possibly due to a bug on the GRAPHEME_CLUSTER_RX regex.

This bug should only happen on PCRE_VERSION < 8.32, however, when combined with the bug #369 , it applies to all PCRE_VERSION that contains a date timestamp, which seems to be the default format.

Therefore, the grapheme_strlen function in this polyfill is likely to provide incorrect results, such as in this example:

Expected result grapheme_strlen('๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ'):

The test is being conducted using the regex: \X

int(1)
int(1)
int(1)
int(1)

Actual result with the custom cluster grapheme_strlen('๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘ฆโ€๐Ÿ‘ฆ'):

The test is being conducted using the regex: (?:\r\n|(?:[ -~\x{200C}\x{200D}]|[แ†จ-แ‡น]+|[แ„€-แ…Ÿ]*(?:[๊ฐ€๊ฐœ๊ฐธ๊ฑ”๊ฑฐ๊ฒŒ๊ฒจ๊ณ„๊ณ ๊ณผ๊ด˜๊ดด๊ต๊ตฌ๊ถˆ๊ถค๊ท€๊ทœ๊ทธ๊ธ”๊ธฐ๊นŒ๊นจ๊บ„๊บ ๊บผ๊ป˜๊ปด๊ผ๊ผฌ๊ฝˆ๊ฝค๊พ€๊พœ๊พธ๊ฟ”๊ฟฐ๋€Œ๋€จ๋„๋ ๋ผ๋‚˜๋‚ด๋ƒ๋ƒฌ๋„ˆ๋„ค๋…€๋…œ๋…ธ๋†”๋†ฐ๋‡Œ๋‡จ๋ˆ„๋ˆ ๋ˆผ๋‰˜๋‰ด๋А๋Šฌ๋‹ˆ๋‹ค๋Œ€๋Œœ๋Œธ๋”๋ฐ๋ŽŒ๋Žจ๋„๋ ๋ผ๋˜๋ด๋‘๋‘ฌ๋’ˆ๋’ค๋“€๋“œ๋“ธ๋””๋”ฐ๋•Œ๋•จ๋–„๋– ๋–ผ๋—˜๋—ด๋˜๋˜ฌ๋™ˆ๋™ค๋š€๋šœ๋šธ๋›”๋›ฐ๋œŒ๋œจ๋„๋ ๋ผ๋ž˜๋žด๋Ÿ๋Ÿฌ๋ ˆ๋ ค๋ก€๋กœ๋กธ๋ข”๋ขฐ๋ฃŒ๋ฃจ๋ค„๋ค ๋คผ๋ฅ˜๋ฅด๋ฆ๋ฆฌ๋งˆ๋งค๋จ€๋จœ๋จธ๋ฉ”๋ฉฐ๋ชŒ๋ชจ๋ซ„๋ซ ๋ซผ๋ฌ˜๋ฌด๋ญ๋ญฌ๋ฎˆ๋ฎค๋ฏ€๋ฏœ๋ฏธ๋ฐ”๋ฐฐ๋ฑŒ๋ฑจ๋ฒ„๋ฒ ๋ฒผ๋ณ˜๋ณด๋ด๋ดฌ๋ตˆ๋ตค๋ถ€๋ถœ๋ถธ๋ท”๋ทฐ๋ธŒ๋ธจ๋น„๋น ๋นผ๋บ˜๋บด๋ป๋ปฌ๋ผˆ๋ผค๋ฝ€๋ฝœ๋ฝธ๋พ”๋พฐ๋ฟŒ๋ฟจ์€„์€ ์€ผ์˜์ด์‚์‚ฌ์ƒˆ์ƒค์„€์„œ์„ธ์…”์…ฐ์†Œ์†จ์‡„์‡ ์‡ผ์ˆ˜์ˆด์‰์‰ฌ์Šˆ์Šค์‹€์‹œ์‹ธ์Œ”์Œฐ์Œ์จ์Ž„์Ž ์Žผ์˜์ด์์ฌ์‘ˆ์‘ค์’€์’œ์’ธ์“”์“ฐ์”Œ์”จ์•„์• ์•ผ์–˜์–ด์—์—ฌ์˜ˆ์˜ค์™€์™œ์™ธ์š”์šฐ์›Œ์›จ์œ„์œ ์œผ์˜์ด์ž์žฌ์Ÿˆ์Ÿค์ €์ œ์ ธ์ก”์กฐ์ขŒ์ขจ์ฃ„์ฃ ์ฃผ์ค˜์คด์ฅ์ฅฌ์ฆˆ์ฆค์ง€์งœ์งธ์จ”์จฐ์ฉŒ์ฉจ์ช„์ช ์ชผ์ซ˜์ซด์ฌ์ฌฌ์ญˆ์ญค์ฎ€์ฎœ์ฎธ์ฏ”์ฏฐ์ฐŒ์ฐจ์ฑ„์ฑ ์ฑผ์ฒ˜์ฒด์ณ์ณฌ์ดˆ์ดค์ต€์ตœ์ตธ์ถ”์ถฐ์ทŒ์ทจ์ธ„์ธ ์ธผ์น˜์นด์บ์บฌ์ปˆ์ปค์ผ€์ผœ์ผธ์ฝ”์ฝฐ์พŒ์พจ์ฟ„์ฟ ์ฟผํ€˜ํ€ดํํฌํ‚ˆํ‚คํƒ€ํƒœํƒธํ„”ํ„ฐํ…Œํ…จํ†„ํ† ํ†ผํ‡˜ํ‡ดํˆํˆฌํ‰ˆํ‰คํŠ€ํŠœํŠธํ‹”ํ‹ฐํŒŒํŒจํ„ํ ํผํŽ˜ํŽดํํฌํˆํคํ‘€ํ‘œํ‘ธํ’”ํ’ฐํ“Œํ“จํ”„ํ” ํ”ผํ•˜ํ•ดํ–ํ–ฌํ—ˆํ—คํ˜€ํ˜œํ˜ธํ™”ํ™ฐํšŒํšจํ›„ํ› ํ›ผํœ˜ํœดํํฌํžˆ]?[แ… -แ†ข]+|[๊ฐ€-ํžฃ])[แ†จ-แ‡น]*|[แ„€-แ…Ÿ]+|[^\p{Cc}\p{Cf}\p{Zl}\p{Zp}])[\p{Mn}\p{Me}\x{09BE}\x{09D7}\x{0B3E}\x{0B57}\x{0BBE}\x{0BD7}\x{0CC2}\x{0CD5}\x{0CD6}\x{0D3E}\x{0D57}\x{0DCF}\x{0DDF}\x{200C}\x{200D}\x{1D165}\x{1D16E}-\x{1D172}]*|[\p{Cc}\p{Cf}\p{Zl}\p{Zp}])

int(1)
int(4)
int(1)
int(4)

Luc45 avatar Sep 10 '21 20:09 Luc45

I forgot to share the code snippet used on the results above: https://3v4l.org/OPBFq#v8.0.10

Luc45 avatar Sep 13 '21 13:09 Luc45

Would you agree with considering that once #369 is merged, this issue can be closed? Aka we don't provide the most recent regexp to ppl that use older PCRE versions?

Alternatively, would you mind looking at improving this regexp? I'm sure I generated it but I don't remember how. There might be a script somewhere in this repo or mayne in https://github.com/tchwork/utf8

nicolas-grekas avatar Sep 13 '21 13:09 nicolas-grekas

Thanks for asking my input.

This package requires PHP 7.1, which seems to use PCRE 8.38 according to 3v4l.org: https://3v4l.org/S1bPl

On the PHP versions made available by 3v4l, 8.32 is used on PHP versions bellow 5.5.9, but I'm not sure if this will always be the case.

Is it possible for PHP 7.1+ to be running PCRE 8.32..?

Luc45 avatar Sep 13 '21 14:09 Luc45

It seems PCRE 8.32 made it's way into PHP core in 2013: https://github.com/php/php-src/commit/357ab3cbada57374075ccf57c9ec25bbbbcb6948

And has been replaced with 8.35 in 2014: https://github.com/php/php-src/commit/dd0e96cca360c5584ec80319ae99fc07c0f2c5f3

I guess it's fine to drop support for the old PCRE_VERSION. It would be ideal if this could be enforced in composer.json through ext-pcre, but given the non-standard version number of PCRE, it can be challenging to enforce the versions.

https://jubianchi.github.io/semver-check/#/^10%20||%20^8.34/8.34%202013-12-15

Or "ext-pcre": "> 8.32":

https://jubianchi.github.io/semver-check/#/%3E%208.32/8.34%202013-12-15

Luc45 avatar Sep 13 '21 14:09 Luc45

It seems that PHP 7.1.0 requires PCRE > 6.6 to compile

Only on PHP 7.3 the version restriction was increased to PCRE > 10.30

These restrictions refer only to compilling PHP with an external PCRE.

Luc45 avatar Sep 13 '21 14:09 Luc45

Actually, only PCRE2 (10+) is able to handle the initial grapheme_strlen example correctly: https://3v4l.org/grqP9

Luc45 avatar Sep 13 '21 21:09 Luc45

I'm going to close here because nobody worked on this. Ppl should upgrade to PCRE 10+ (or contribute a fix here ;) )

nicolas-grekas avatar Jan 30 '23 17:01 nicolas-grekas