polyfill icon indicating copy to clipboard operation
polyfill copied to clipboard

mb_convert_encoding($x, $y, 'HTML-ENTITIES') not functionally equivalent

Open cpeel opened this issue 4 years ago • 1 comments

When calling mb_convert_encoding() with $fromEncoding === 'HTML-ENTITIES', the polyfill does not return functionally equivalent strings to the native function. This is because mb_convert_encoding() uses html_entity_decode() when $fromEncoding === 'HTML-ENTITIES' and that function does not return characters for many numeric entities 0-31 and 127-159. For example:

<?php

require "vendor/symfony/polyfill-mbstring/Mbstring.php";

use Symfony\Polyfill\Mbstring as p;

for($i = 0; $i < 1024; $i++) {
	$string = "&#" . $i . ";";
	$mbstring = mb_convert_encoding($string, 'UTF-8', 'HTML-ENTITIES');
	$polyfill = p\Mbstring::mb_convert_encoding($string, 'UTF-8', 'HTML-ENTITIES');
	if($mbstring != $polyfill) {
		echo "Mismatch: $string - mbstring: $mbstring; polyfill: $polyfill\n";
	}
}

outputs:

Mismatch: &#0; - mbstring: ; polyfill: &#0;
Mismatch: &#1; - mbstring: ; polyfill: &#1;
Mismatch: &#2; - mbstring: ; polyfill: &#2;
Mismatch: &#3; - mbstring: ; polyfill: &#3;
Mismatch: &#4; - mbstring: ; polyfill: &#4;
Mismatch: &#5; - mbstring: ; polyfill: &#5;
Mismatch: &#6; - mbstring: ; polyfill: &#6;
Mismatch: &#7; - mbstring: ; polyfill: &#7;
Mismatch: &#8; - mbstring:; polyfill: &#8;
Mismatch: &#11; - mbstring:
                            ; polyfill: &#11;
Mismatch: &#12; - mbstring:
                            ; polyfill: &#12;
Mismatch: &#14; - mbstring: ; polyfill: &#14;
Mismatch: &#15; - mbstring: ; polyfill: &#15;
Mismatch: &#16; - mbstring: ; polyfill: &#16;
Mismatch: &#17; - mbstring: ; polyfill: &#17;
Mismatch: &#18; - mbstring: ; polyfill: &#18;
Mismatch: &#19; - mbstring: ; polyfill: &#19;
Mismatch: &#20; - mbstring: ; polyfill: &#20;
Mismatch: &#21; - mbstring: ; polyfill: &#21;
Mismatch: &#22; - mbstring: ; polyfill: &#22;
Mismatch: &#23; - mbstring: ; polyfill: &#23;
Mismatch: &#24; - mbstring: ; polyfill: &#24;
Mismatch: &#25; - mbstring: ; polyfill: &#25;
Mismatch: &#26; - mbstring: ; polyfill: &#26;
Mismatch: &#27; - mbstring:  polyfill: &#27;
Mismatch: &#28; - mbstring: ; polyfill: &#28;
Mismatch: &#29; - mbstring: ; polyfill: &#29;
Mismatch: &#30; - mbstring: ; polyfill: &#30;
Mismatch: &#31; - mbstring: ; polyfill: &#31;
Mismatch: &#39; - mbstring: '; polyfill: &#39;
Mismatch: &#127; - mbstring: ; polyfill: &#127;
Mismatch: &#128; - mbstring: €; polyfill: &#128;
Mismatch: &#129; - mbstring: ; polyfill: &#129;
Mismatch: &#130; - mbstring: ‚; polyfill: &#130;
Mismatch: &#131; - mbstring: ƒ; polyfill: &#131;
Mismatch: &#132; - mbstring: „; polyfill: &#132;
Mismatch: &#133; - mbstring: …; polyfill: &#133;
Mismatch: &#134; - mbstring: †; polyfill: &#134;
Mismatch: &#135; - mbstring: ‡; polyfill: &#135;
Mismatch: &#136; - mbstring: ˆ; polyfill: &#136;
Mismatch: &#137; - mbstring: ‰; polyfill: &#137;
Mismatch: &#138; - mbstring: Š; polyfill: &#138;
Mismatch: &#139; - mbstring: ‹; polyfill: &#139;
Mismatch: &#140; - mbstring: Œ; polyfill: &#140;
Mismatch: &#141; - mbstring: ; polyfill: &#141;
Mismatch: &#142; - mbstring: Ž; polyfill: &#142;
Mismatch: &#143; - mbstring: ; polyfill: &#143;
Mismatch: &#144; - mbstring: ; polyfill: &#144;
Mismatch: &#145; - mbstring: ‘; polyfill: &#145;
Mismatch: &#146; - mbstring: ’; polyfill: &#146;
Mismatch: &#147; - mbstring: “; polyfill: &#147;
Mismatch: &#148; - mbstring: ”; polyfill: &#148;
Mismatch: &#149; - mbstring: •; polyfill: &#149;
Mismatch: &#150; - mbstring: –; polyfill: &#150;
Mismatch: &#151; - mbstring: —; polyfill: &#151;
Mismatch: &#152; - mbstring: ˜; polyfill: &#152;
Mismatch: &#153; - mbstring: ™; polyfill: &#153;
Mismatch: &#154; - mbstring: š; polyfill: &#154;
Mismatch: &#155; - mbstring: ›; polyfill: &#155;
Mismatch: &#156; - mbstring: œ; polyfill: &#156;
Mismatch: &#157; - mbstring: ; polyfill: &#157;
Mismatch: &#158; - mbstring: ž; polyfill: &#158;
Mismatch: &#159; - mbstring: Ÿ; polyfill: &#159;

While many of these are control characters (and the native function does return them), the single quote (dec 39) is particularly problematic.

cpeel avatar Mar 18 '21 04:03 cpeel

For the case of single quotes, I think it might be fixed by passign the ENT_QUOTES flag to html_entity_decode

stof avatar Mar 18 '21 09:03 stof