una icon indicating copy to clipboard operation
una copied to clipboard

Slugs/URIs with Accented Characters Cause Inconsistent URLs and 404 Errors (French Language)

Open AlexTr opened this issue 7 months ago • 1 comments

Problem Description

When the platform is used in French, page titles, profile names, and other content may contain accented characters (e.g., é, à, ô, ç, etc.). Currently, the slug/URI generation logic does not properly transliterate these characters. This leads to two major issues:

  1. Inconsistent URLs:

• Sometimes, URLs containing accents are accessible (e.g., /page/élève). • Other times, the URLs become extremely long or malformed (due to double encoding or browser misinterpretation), resulting in 404 errors.
  2. User Experience:

• Users expect URLs to be clean and accessible regardless of accents. • Browsers and crawlers may fail to resolve URLs with non-ASCII characters, impacting SEO and usability.

Technical Analysis

The root cause is in the uriFilter function in utils.inc.php:

function uriFilter ($s, $aParams = [])
{
$sEmpty = isset($aParams['empty']) ? $aParams['empty'] : '-';
$sDivider = isset($aParams['divider']) ? $aParams['divider'] : '-';

if(BxTemplConfig::getInstance()->bAllowUnicodeInPreg)
$s = get_mb_replace ('/[^\pL^\pN^_]+/u', $sDivider, $s); // unicode characters
else
$s = get_mb_replace ('/([^\d^\w]+)/u', $sDivider, $s); // latin characters only

$s = get_mb_replace ('/([' . $sDivider . '^]+)/', $sDivider, $s);
$s = get_mb_replace ('/([' . $sDivider . ']+)$/', '', $s); // remove trailing dash
if(!$s)
$s = $sEmpty;

return !isset($aParams['lowercase']) || $aParams['lowercase'] === true ? mb_strtolower($s) : $s;
}

Issue: There is no transliteration step to convert accented characters to their ASCII equivalents (e.g., é → e). As a result, slugs may contain raw UTF-8 characters, which are not always handled consistently by browsers or web servers.

Steps to Reproduce

  1. Set UNA CMS to French.
  2. Create a profile or page with a name/title containing accents (e.g., Élève très appliqué).
  3. Observe the generated URL: • Sometimes it works (with accents in the URL). • Sometimes it results in a 404 or a very long, encoded URL.

Proposed Patch

Add a transliteration step using iconv at the start of the uriFilter function:

function uriFilter ($s, $aParams = [])
{
// Transliterate accented characters to ASCII
$s = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $s);

$sEmpty = isset($aParams['empty']) ? $aParams['empty'] : '-';
$sDivider = isset($aParams['divider']) ? $aParams['divider'] : '-';

if(BxTemplConfig::getInstance()->bAllowUnicodeInPreg)
$s = get_mb_replace ('/[^\pL^\pN^_]+/u', $sDivider, $s); // unicode characters
else
$s = get_mb_replace ('/([^\d^\w]+)/u', $sDivider, $s); // latin characters only

$s = get_mb_replace ('/([' . $sDivider . '^]+)/', $sDivider, $s);
$s = get_mb_replace ('/([' . $sDivider . ']+)$/', '', $s); // remove trailing dash
if(!$s)
$s = $sEmpty;

return !isset($aParams['lowercase']) || $aParams['lowercase'] === true ? mb_strtolower($s) : $s;
}

This ensures all accented characters are converted to their closest ASCII equivalents before further processing.

Expected Result

• URLs are always ASCII-only, clean, and accessible. • No more 404 errors or browser misinterpretation due to accents. • Improved SEO and user experience.

AlexTr avatar May 22 '25 08:05 AlexTr