SMF
SMF copied to clipboard
Search issue - multi-byte words truncated
Description
If there are multi-byte terms in a post, they are translated to html entities, which can take up to 8 or 9 bytes per character. The problem is that search words are truncated at 20 characters. Thus, any string of multibyte characters can result in truncation - mid html-entity, e.g.:
Note the truncated html entities, e.g., &#
and B
. This is in the log_search_subjects table.
This causes further issues down the road, e.g., executing an html entity to utf8 conversion, you can get:
...as that word really isn't unique once it has been truncated.
This issue exists in 2.0 as well.
In 2.1, this issue is restricted to 4-byte character usage, as anything <4-bytes is no longer converted to html entities - though they may be brought forward during an upgrade.
Steps to reproduce
- Post this: 𐌼𐌰𐌲 𐌲𐌻𐌴𐍃 𐌹̈𐍄𐌰𐌽, 𐌽𐌹 𐌼𐌹𐍃 𐍅𐌿 𐌽𐌳𐌰𐌽 𐌱𐍂𐌹𐌲𐌲𐌹𐌸.
- Look at log_search_subjects for that post
Environment (complete as necessary)
- Version/Git revision: current
- Database Type: mysql
- Database Version: 5.7
- PHP Version: 7.4
Additional information/references
4-byte characters are not common outside the use of emojis, certain symbols, and ancient texts....
But the SMF crowd is exactly the kinda crowd to use emojis, certain symbols, and ancient texts...
Based on the description, The issue exists only in MySQL.
In MySQL could be a workaround convert the database to mb4 and change the parameter in sub mysql file.
Correct, mysql only. MB4 will require some db changes & increasing minimum version of mysql.
But mb4 is definitely the right direction.
What i mean, smf is already prepared for this, since i added some time ago this parameter, which can be changed by hand: https://github.com/SimpleMachines/SMF2.1/blob/bb7f39482ba05b0d7d209f302e682d01e6a9fbd9/Sources/Subs-Db-mysql.php#L60