JoliTypo
JoliTypo copied to clipboard
Some maintenance work
- Allow org_heigl/hyphenator in ^3.0
- Simplify composer.json 'scripts' section
- Drop support for Symfony < 4.4 + Add support for ^6.0
- Run tests on PHP 8.1
there is a failure on PHP 8.1, but I don't know how to fix it:
1) JoliTypo\Tests\EnglishTest::testFixFullText
Failed asserting that two strings are equal.
--- Expected
+++ Actual
@@ @@
<h3>Pronun­ci­ation</h3>\n
\n
<p>A humor­ous image announ­cing the launch of a White House Tumblr suggests pronoun­cing GIF with a hard “G”.</p>\n
-<p>The creat­ors of the format pronounced GIF as “Jif” with a soft “G” /ˈdʒɪf/ as in “gin”.</p>\n
-<p>An altern­at­ive pronun­ci­ation with a hard “G” /ˈɡɪf/ as in “graph­ics”, reflect­ing the expan­ded acronym, is in wide­spread usage.</p>\n
+<p>The creat­ors of the format pronounced GIF as “Jif” with a soft “G” /ˈdʒɪf/ as in “gin”.</p>\n
+<p>An altern­at­ive pronun­ci­ation with a hard “G” /ˈɡɪf/ as in “graph­ics”, reflect­ing the expan­ded acronym, is in wide­spread usage.</p>\n
<p>Both pronun­ci­ations are acknow­ledged by the […] Merriam-Webster’s Collegi­ate Diction­ary.</p>\n
\n
<p>We also have “<span>HTML in quote</span>” to fix…</p>'
The line causing the bug is this one : https://github.com/jolicode/JoliTypo/blob/master/src/JoliTypo/Fixer.php#L311
For an uknown reason (from me at least), the behaviour of this function changed in PHP 8.1. I tried running this code in both PHP 8.0 and PHP 8.1
$tofix = "/ˈdʒɪf/";
echo(mb_detect_encoding($tofix) . PHP_EOL);
mb_detect_order('ASCII,UTF-8,ISO-8859-1,windows-1252,iso-8859-15');
echo(mb_detect_encoding($tofix));
In PHP 8.0 I get
UFT-8
UTF-8
In PHP 8.1 I get
UTF-8
Windows-1252
Sandbox showing the issue : https://onlinephp.io/c/5f836
So after doing some more research I found out that the function that has changed is not mb_encoding_order
but mb_detect_encoding
. In this bug report we can see the following answer from Alex Dowad :
Question :
I'm guessing that in earlier versions there was no such thing as demerits (or the algo was even more different), so the provided priority list had more impact on the result. Is that right?
Answer :
Yep. The earlier versions of mb_detect_encoding only checked which encodings the input string was valid in, and picked the first one on the list. If the string was valid in more than one encoding, it did not do anything at all to try to figure out which one was most likely.
So now mb_detect_encoding
just returns the most likely encoding and doesn't care about the order they were given. This is confirmed by this comment : https://github.com/php/php-src/issues/8279#issuecomment-1083613035
Now it also applies heuristics to detect which of the valid text encodings in the specified list (if there are more than one) is most likely to be correct.
This is not what we want to do in this library. We do not want to get the most likely encoding, we want to validate that the text we are given is valid in some encodings, with some preferences. Therefore I suggest we test the encodings one by one, probably in a loop like suggested in the comment above. This would restore our previous behaviour.
I tried fixing it :
- fix a test that was not working anymore (now it matches the other similar tests).
- remove all the now useless
mb_detect_order
. - found a workaround for
mb_detect_encoding
to detect UTF-8 in priority but it is now returning UTF-8 all the time. This is what we had before and it seems to be ok. - removed encoding to HTML entites. It is deprecated to do it with
mb_convert_encoding
in PHP 8.2 (https://github.com/php/php-src/commit/9308974f8cc6c1046f228be5320fe067913ba987). We actually don't want to decode thinks like<3
to<3
because then libxml sees an open HTML tag. It was working before because we took advantage of the weird behaviour ofmb_detect_encoding
with some HTML entites (more info on this in the commit message above).
Awesome work!
I'll take care of the deprecation if you want. Or if you want, you must add: https://github.com/symfony/recipes/blob/main/symfony/framework-bundle/5.3/config/packages/framework.yaml#L5
Thank you both :+1: