processwire-issues icon indicating copy to clipboard operation
processwire-issues copied to clipboard

Some extended page names get broken by punycode transformation

Open inreti-sb opened this issue 1 year ago • 27 comments

Short description of the issue

Some (at least german) special characters get broken in extended page names. Before being stored in the database, special characters are transformed to punycode. Looks like it is using some non-standard punycode to encode the names and this is where some characters seem to break.

Expected behavior

(Pagetitle → Pagename)

Fuß → fuß Füße → füße

Actual behavior

Fuß → s Füße → füsse

Optional: Screenshots/Links that demonstrate the issue

image png 388272219694a695ee29131ed8009c92

Processwire extended page name bug punycode in database

We verified the bug on different and fresh installs and discussed it with screenshots in this forum thread: https://processwire.com/talk/topic/30444-prepend-date-in-page-name-by-hook-creates-improper-results-on-some-existing-pages/

Optional: Suggestion for a possible fix

The Processwire function to convert unicode characters to ASCII characters seems to have a bug / be non-standard.

Steps to reproduce the issue

  1. Fresh Processwire install with blank profile
  2. Enable Extended Page Name Feature for UTF-8 support like here: https://processwire.com/blog/posts/page-name-charset-utf8/
  3. Create a new page with page title Fuß and save the page
  4. Look at the wrong generated pagename
  5. Compare the according punycode stored in in the database with the standard punycode by a IDN Punycode converter.

Setup/Environment

  • ProcessWire version: The current newest stable one 3.0.229
  • (Optional) PHP version: 8.2
  • (Optional) MariaDB version: 10.11.10
  • (Optional) Any 3rd party modules that are installed and could be related to the issue: Even on a fresh install with blank site profile.

inreti-sb avatar Dec 28 '24 17:12 inreti-sb

I can replicate it.

Interesting:

$p = $pages->add("test_template", "/", "Füße Füße"); $pages->uncacheAll(); d($p->name); // 'füße-füße'

$string="Köln"; d($sanitizer->pageName($string)); // 'kln' d($sanitizer->pageName($string, 16)); // 'köln' d($sanitizer->pageNameUTF8($string)); // 'köln'

Same with Fuß and Füße. 

In page editor it's correct, until you save:

image

after save:

image

matjazpotocnik avatar Dec 30 '24 12:12 matjazpotocnik

@inreti-sb @matjazpotocnik Thanks, I've pushed a fix for this. It wasn't in the Punycode library that we use, as that is only a fallback if PHP's IDN functions aren't available. And the Punycode library we are using appears to work correctly when forcing it to use it over PHP's IDN functions. It's the the PHP IDN functions that had changed behavior in some recent PHP version. PHP changed the default behavior of idn_to_ascii and idn_to_utf8, moving the previous behavior to a transitional flag argument. Using IDNs that aren't actual domains now requires that flag, so I have added it, which makes the functions work how we want them to again.

There was also another issue related to page names longer than 50 characters, which is longer than is supported in an IDN, so we have to split them up into 12 byte chunks and then only encode or decode the chunks that need it.

Can yo confirm it is also fixed there? Thanks.

ryancramerdesign avatar Dec 31 '24 16:12 ryancramerdesign

@ryancramerdesign I can confirm it's fixed. Thank you!

matjazpotocnik avatar Dec 31 '24 18:12 matjazpotocnik

Happy new year to everyone. Thanks for verifying @matjazpotocnik and thanks so much for the quick fix at the end of the year @ryancramerdesign!

Short pagename seem to work in my first tests, longer ones not really.

E.g.: Köln gegen Köln in Köln mit allen Köln Köln gegen Köln in Köln mit allen Köln → köln-gegen-kln-in-kln-mit-allen-köln-köln-gegen-köln-in-köln-mit-allen-kln/

Can someone verify this, please?

inreti-sb avatar Jan 01 '25 19:01 inreti-sb

You're right; it works on shorter strings but not on longer ones. It looks like 50 is the magic number:

Köln gegen Köln in Köln mit allen Köln Köln   -> 48 bytes, ok
Köln gegen Köln in Köln mit allen Köln KölnA  -> 49 bytes, ok
Köln gegen Köln in Köln mit allen Köln KölnAB -> 50 bytes, not ok
Köln gegen Köln in Köln mit allen Köln KolnAB -> 49 bytes, ok (replaced last ö with o)

matjazpotocnik avatar Jan 02 '25 07:01 matjazpotocnik

What I found:

  • It looks like idn_to_ascii works with a length of up to 64 bytes, at least in PHP 8.2
  • Max IDN length is 255 bytes encoded
  • Fallback library (very old) works as expected :-) no need for splitting the string. There are some deprecations, but there is a fix (line 154 and 158):
154       $code = $t + (((int) $q - $t) % (static::BASE - $t)); //MP
155       $output .= static::$encodeTable[$code];
156       $q = ($q - $t) / (static::BASE - $t);
157    }
158    $output .= static::$encodeTable[(int) $q]; //MP
  • idn_to_ascii("öln-in-köln-", 16) return false! error IDNA_ERROR_TRAILING_HYPHEN (16)
  • idn_to_ascii("0---0.dk") return false! error: IDNA_ERROR_HYPHEN_3_4 (32) - UTS#46 does not allow labels with hyphens in the 3rd and 4th positions.
  • and so on...

What to do? We could call idn_to_ascii("öln-in-köln-", 16, INTL_IDNA_VARIANT_UTS46, $info); and use $info['result']? Or use fallback library?

matjazpotocnik avatar Jan 02 '25 15:01 matjazpotocnik

@matjazpotocnik Those sound like good solutions. I'm not sure how best to handle this case though. If the output of the punyEncodeName() changes, then UTF-8 pages that had their names encoded with it after PHP 7.4, might be incorrect. But correcting it means those pages will no longer be accessible by URL, because the URL might very well not be a correct conversion, but it's what's already being used and saved in the DB. So I think I may have to revert back to the original version before the fix, and then use the new/corrected version for new installations after January 2 2025. I'm also thinking that if you set your $config->pageNameCharset to "UTF-8" rather than "UTF8" then it would use the new/correct version. Then I could update any instructions to refer to "UTF-8", but existing installations could continue as-is. It would be great if PHP hadn't changed the behavior of the function in 7.4 but since they did, I just need to figure out the best way to fix it without affecting existing installations.

ryancramerdesign avatar Jan 02 '25 20:01 ryancramerdesign

I don't know what the best solution would be. Wouldn't changing from "UTF8" to "UTF-8" be confusing? After reading the comment by Bruce MacNaughton at https://processwire.com/blog/posts/page-name-charset-utf8/, maybe we could use a different name? But then again, that might be even more confusing. Whatever solution would be acceptable if it works correctly and the documentation is in place.

matjazpotocnik avatar Jan 02 '25 20:01 matjazpotocnik

My thought is it wouldn't be confusing because we wouldn't be telling anyone to use UTF8. It would just be a way for existing installations to be untouched, with no need to even know about UTF-8 vs UTF8. It could be something else, but I don't see any reason to have it be an "either or" option since there would be no reason to have the old/broken behavior except if already stuck with it.

On Thu, Jan 2, 2025, 3:37 PM Matjaž Potočnik @.***> wrote:

I don't know what the best solution would be. Wouldn't changing from "UTF8" to "UTF-8" be confusing? After reading the comment by Bruce MacNaughton at https://processwire.com/blog/posts/page-name-charset-utf8/, maybe we could use a different name? But then again, that might be even more confusing. Whatever solution would be acceptable if it works correctly and the documentation is in place.

— Reply to this email directly, view it on GitHub https://github.com/processwire/processwire-issues/issues/2015#issuecomment-2568347810, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACQEUB53MLV4QASGC2IP7T2IWPP7AVCNFSM6AAAAABUJ6VOZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRYGM2DOOBRGA . You are receiving this because you were mentioned.Message ID: @.***>

ryancramerdesign avatar Jan 02 '25 20:01 ryancramerdesign

Installs after Jan 2 2025 would use the fixed behavior either way, UTF8 or UTF-8 or whatever alternative name is used.

On Thu, Jan 2, 2025, 3:42 PM Ryan Cramer @.***> wrote:

My thought is it wouldn't be confusing because we wouldn't be telling anyone to use UTF8. It would just be a way for existing installations to be untouched, with no need to even know about UTF-8 vs UTF8. It could be something else, but I don't see any reason to have it be an "either or" option since there would be no reason to have the old/broken behavior except if already stuck with it.

On Thu, Jan 2, 2025, 3:37 PM Matjaž Potočnik @.***> wrote:

I don't know what the best solution would be. Wouldn't changing from "UTF8" to "UTF-8" be confusing? After reading the comment by Bruce MacNaughton at https://processwire.com/blog/posts/page-name-charset-utf8/, maybe we could use a different name? But then again, that might be even more confusing. Whatever solution would be acceptable if it works correctly and the documentation is in place.

— Reply to this email directly, view it on GitHub https://github.com/processwire/processwire-issues/issues/2015#issuecomment-2568347810, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACQEUB53MLV4QASGC2IP7T2IWPP7AVCNFSM6AAAAABUJ6VOZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRYGM2DOOBRGA . You are receiving this because you were mentioned.Message ID: @.***>

ryancramerdesign avatar Jan 02 '25 20:01 ryancramerdesign

@matjazpotocnik @inreti-sb I've pushed another update that makes it use the old/buggy behavior for existing installations and the fixed behavior for new installations (after tomorrow). I decided not to do the UTF-8 vs UTF8 toggle because it was too many lines of code to change and I'm feeling too lazy for that. Instead, you can force the behavior by prepending it to your $config->pageNameWhitelist as v1, v2 or v3. The v1 is the old/buggy behavior in PHP 7.4+, the v2 is the dedicated Punycode library, and the v3 is fixed behavior in PHP 7.4+.

$config->pageNameWhitelist = 'v3' . $config->pageNameWhitelist; 

This is just to ensure there's a way for existing installations to use the fixed behavior, should they not have concerns about existing page that might already in the DB.

ryancramerdesign avatar Jan 03 '25 17:01 ryancramerdesign

Thank you @ryancramerdesign and @matjazpotocnik for making this important fix possible quickly! In my first tests the fix seem to work perfectly.

inreti-sb avatar Jan 03 '25 22:01 inreti-sb

$s="öln-in-köln-";
$a=$sanitizer->punyEncodeName($s,3); // return 'ln-in-kln-' WRONG
$b=$sanitizer->punyDecodeName($a,3); // return 'ln-in-kln-' WRONG

this line is the problem:

if(strlen($_value) >= 50) $value = $info['result'];

IDN return false even on shorter strings, as the input value is incorrect. So, above line should be

if($value === false) $value = $info['result']; or if(empty($value)) $value = $info['result']; or simply $value = $info['result'];

matjazpotocnik avatar Jan 04 '25 09:01 matjazpotocnik

Thanks @matjazpotocnik I've updated it to always use $info[result]

ryancramerdesign avatar Jan 05 '25 15:01 ryancramerdesign

Don't know if it's related to this bug and fix, but ! (exclamation marks) are sometimes substituted with - (dashes) although dashes are in the pageNameWhitelist. E.g.:

Werde Teil des Projekts! → werde-teil-des-projekts-
Schluss mit diesem Blödsinn! Gegen Spiele! → schluss-mit-diesem-blödsinn!-gegen-spiele!

inreti-sb avatar Jan 07 '25 07:01 inreti-sb

I tried and got:

Werde Teil des Projekts! → werde-teil-des-projekts Schluss mit diesem Blödsinn! Gegen Spiele! → schluss-mit-diesem-blödsinn-gegen-spiele

matjazpotocnik avatar Jan 07 '25 20:01 matjazpotocnik

Did you added ! in the whitelist, @matjazpotocnik ? :)

inreti-sb avatar Jan 07 '25 20:01 inreti-sb

If i add " (customs mark) to the pageNameWhitelist all this characters get substituted by quot (with a dash). "Foo" → quot-foo-quot

Maybe it is helpful - if not already the case - to set up tests to check the correct pagenames (output) for a list of pagetitles (input)?

inreti-sb avatar Jan 08 '25 14:01 inreti-sb

Maybe it is helpful - if not already the case - to set up tests to check the correct pagenames (output) for a list of pagetitles (input)?

That would be an extremely useful step imho! Please make that happen :)

BernhardBaumrock avatar Jan 08 '25 14:01 BernhardBaumrock

@inreti-sb You maybe can take my special cases I've tested as a starting point. They should definitely work now. Let me know if I can help.

poljpocket avatar Jan 08 '25 14:01 poljpocket

@inreti-sb The double quote is part of the page name blacklist, so isn't supported even if part of your whitelist. Here's the blacklist:

$blacklist = '/\\%"\'<>?#@:;,+=*^$()[]{}|&';

While exclamation point isn't part of our blacklist, google seems to suggest it isn't supported in Punycode, so maybe it should be added to the blacklist, I'm not sure, I have to test it out. I did just push some other improvements also.

ryancramerdesign avatar Jan 08 '25 15:01 ryancramerdesign

@inreti-sb I didn't add ! to the whitelist as I think these characters are not allowed:

(){}[]|`¬¦! "£$%^&*"<>:;#~_-+=,@

matjazpotocnik avatar Jan 08 '25 16:01 matjazpotocnik

If i add " (customs mark) to the pageNameWhitelist all this characters get substituted by quot (with a dash). "Foo" → quot-foo-quot

I do not yet know the cause, but this is only happening on pagenames generated by my hook here: https://processwire.com/talk/topic/30444-prepend-date-in-page-name-by-hook-creates-improper-results-on-some-existing-pages/ Will do more research and perhaps ask for help and write the solution there.

inreti-sb avatar Jan 09 '25 12:01 inreti-sb

Now I use the new Sanitizer.php version but still get:

✗ I have a dream! → i-have-a-dream-
✗ Was! folgt? → was--folgt-
✓ Was! folgt? nach 100 € Katar? → was-folgt-nach-100-katar

I would expect that not allowed characters should be removed. On shorter pagenames this seems not do be the case.

inreti-sb avatar Jan 09 '25 12:01 inreti-sb

And I would expect and prefer to only have one dash here. What do you think?

✗ This is the dash - test → this-is-the-dash---test

inreti-sb avatar Jan 09 '25 13:01 inreti-sb

Hi,

with Processwire 3.0.246 and PHP 8.2 I still have bad behavior with all "v" variants.

I have

$config->pageNameCharset = 'UTF8';
$config->pageNameWhitelist = '-_.abcdefghijklmnopqrstuvwxyz0123456789æåäßöüđжхцчшщюяàáâèéëêěìíïîõòóôøùúûůñçčćďĺľńňŕřšťýžабвгдеёзийклмнопрстуфыэęąśłżź';

When I create a new page using the "Discussion2" module, which lets processwire auto-generate the page name from the title, and the title is

Texte pour le 17.02.25 également

The new page cannot be shown using its url. Depending on the mangling of the pageNameWhitelist with the "v" flag, I see the following requests in firefox (Response code, method, URL):

No v flag: 301 POST http://localhost/bmw/forum/forum-244k76406/?new=1
301 GET http://localhost/bmw/forum/forum-244k76406/texte-pour-le-17.02.25-%C3%A9galement-22/ 404 GET http://localhost/bmw/forum/forum-244k76406/te-pour-le-17.02.xn--25-galement-22-dkb/

v1: 301 POST http://localhost/bmw/forum/forum-244k76406/?new=1
301 GET http://localhost/bmw/forum/forum-244k76406/texte-pour-le-17.02.25-%C3%A9galement-23/ 404 GET http://localhost/bmw/forum/forum-244k76406/te-pour-le-17.02.xn--25-galement-23-dkb/ (badly mangled url)

v2: 301 POST http://localhost/bmw/forum/forum-244k76406/?new=1
301 GET http://localhost/bmw/forum/forum-244k76406/texte-pour-le-17.02.25-%C3%A9galement-2/ 404 GET http://localhost/bmw/forum/forum-244k76406/texte-%C3%B4pour-le.02.25-%C3%A9galement-2/ (somehow inserted character after "texte-")

v3: 301 POST http://localhost/bmw/forum/forum-244k76406/?new=1
301 GET http://localhost/bmw/forum/forum-244k76406/texte-pour-le-17.02.25-%C3%A9galement-3/ 404 GET http://localhost/bmw/forum/forum-244k76406/texte-pour-le-17-.02.25-%C3%A9galement-3/ (additional "-" appearing after "17")

Regards, moritz

moritzboth avatar Feb 18 '25 15:02 moritzboth

Just ran into similar issues here. Don't currently have a good test case as I can't reproduce it myself, but it does seem that there are cases where URLs with some characters don't work (at least when converted from title). Strangely enough just re-saving the page (at least as a superuser) fixed this.

(This was in PHP 7.4 and PW 2.4.6.)

teppokoivula avatar May 09 '25 07:05 teppokoivula